Add ability to read from cloud storage #22

hammer · 2023-12-04T16:33:01Z

Perhaps with the object_store crate?

CarlKCarlK · 2023-12-04T16:45:28Z

Very interesting idea. Thanks!

Object_store's vectored_read seems especially on-point.
This would be most useful if Python users of bed-reader could use it, too. I wonder if there are any examples of Python crates that under-the-covers use object_store.

Time line: I should be able to at least take a look in more detail in a few weeks.
(Also, either here or privately, please let me know how useful/important this would be you/your project/your users.)

hammer · 2023-12-04T17:01:18Z

https://github.com/roeap/object-store-python seems to be a Python interface.

Thanks for the consideration Carl!

hammer · 2023-12-04T17:09:59Z

Also, some details on our use case: we are finally writing up a publication on sgkit at https://github.com/pystatgen/sgkit-publication. I'm hoping to demonstrate how to run a GWAS using a large synthetic dataset called HAPNEST: sgkit-dev/sgkit-publication#43. The BED files for the largest chromosomes are over 100 GB, and I'm hoping to read them in from cloud storage rather than copy them to a local filesystem to read. I suppose I could explore ways of mounting cloud storage on the local filesystem, but it would be nicer I think to just use the cloud storage APIs themselves.

CarlKCarlK · 2023-12-04T17:21:54Z

A request for more info to allow eventual performance testing ...
So, the largest bed file is 1 millions samples by ??? # variants. Would a typical read by all the samples x some slice of variants? If so, how many variants at time would be typical?
Which cloud provider would be your first choice? (I'll need to test on one and it might as well be the same as yours. I'm especially interested/worried in seeing the vectored_read work efficiently.)

hammer · 2023-12-07T15:44:15Z

1 million samples by 6.8 million variants. Our use case is to read the entire BED file and write it back out in the Zarr storage format. Currently we don't do anything intelligent to read and write in blocks, unfortunately, but that's something we should probably be able to add in our library, as we do when reading large VCF files.

hammer · 2023-12-07T15:44:40Z

Oh and for first choice cloud provider I'm currently using Google, but I'd say Amazon is the most common and I'd be happy to switch to them if you prefer.

CarlKCarlK · 2023-12-07T16:01:12Z

Thanks for the additional info. Back-of-envelope (actually Excel): 1M x 6.8M x 8% (chrome 1) * 1 byte (i8 values) / (1000*1000*1000) = 544 Gbytes. Can you really handle a numpy array that big? AWS (or Azure) would be better for me. I've used it before and have Amazon connections if needed. From: Jeff Hammerbacher ***@***.***> Sent: Thursday, December 7, 2023 7:44 AM To: fastlmm/bed-reader ***@***.***> Cc: Carl Kadie ***@***.***>; Comment ***@***.***> Subject: Re: [fastlmm/bed-reader] Add ability to read from cloud storage (Issue #22) Importance: High 1 million samples by 6.8 million variants. Our use case is to read the entire BED file and write it back out in the Zarr storage format. Currently we don't do anything intelligent to read and write in blocks, unfortunately, but that's something we should probably be able to add in our library, as we do when reading large VCF files. - Reply to this email directly, view it on GitHub<#22 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABR65PZXZW5UJCSA77ZDWBTYIHP5TAVCNFSM6AAAAABAGIQCCCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBVGU4DAOBQG4>. You are receiving this because you commented.Message ID: ***@***.******@***.***>>

CarlKCarlK · 2023-12-19T23:53:19Z

I'm making progress. Below is some (working) example code.

Current API: a Bed struct (that contains a file path)
Proposed, Additional API:

a BedCloud struct (that contains an object store and a store path).
anything that might involve "reading" must be .awaited
You can control the number of concurrent async calls to the cloud
You can control the size of the buffer (number of SNPs in each async call to the cloud)

    let object_store = Arc::new(LocalFileSystem::new());
    let file_path = sample_bed_file("plink_sim_10s_100v_10pmiss.bed")?;
    let store_path = StorePath::from_filesystem_path(file_path).map_err(BedErrorPlus::from)?;
    
    let mut bed_cloud = BedCloud::new((object_store, store_path)).await?;
    let val = bed_cloud.read::<i8>().await?;
    println!("{val:?}");

Another example:

    let object_path = sample_bed_object_path("plink_sim_10s_100v_10pmiss.bed")?;

    let mut bed_cloud = BedCloud::new(object_path).await?;
    let val = ReadOptions::builder()
        .count_a2()
        .i8()
        .read_cloud(&mut bed_cloud)
        .await?;

    let mean = val.mapv(|elem| elem as f64).mean().unwrap();
    println!("{mean:?}");
    assert!(mean == -13.274); // really shouldn't do mean on data where -127 represents missing

@hammer Let me know if this looks like it would work well with your use case.

hammer · 2023-12-21T14:28:48Z

Thanks Carl! Let me cc @tomwhite as I believe he wrote the BED reader code in sgkit. I assume you'd wrap these Rust interfaces with a Python interface in your library?

CarlKCarlK · 2023-12-21T16:33:24Z

Yeah, I kind of forgot that everyone (including my other projects) use the library from Python and not from Rust. :-)

My next steps:

(Holiday stuff)
Get it working well from AWS, not just object-store's LocalFileSystem
Create the Python interface
Get Write (not just Read) working.

@hammer or @tomwhite, if you have any or know of any large non-private PLINK Bed files on AWS (or Google if necessary) that I can use for performance testing, please let me know.

hammer · 2024-01-11T01:04:08Z

@CarlKCarlK Happy new year! I've been using the files from https://www.ebi.ac.uk/biostudies/studies/S-BSST936 for testing. I have all of the smaller files (600 samples) and chr20 for the 1,000,000 samples on Google Cloud Storage if that would be helpful. It took me about 30 minutes to get chr20 off of FTP and into the cloud, so it may just be easier to do yourself too.

CarlKCarlK · 2024-01-11T22:03:12Z

@hammer & @tomwhite

Please see the end of this note for something for you to try.

Meanwhile, some observations. I'm finding this feature somewhat frustrating. Let me share some of those frustrations and perhaps you can make suggestions (or just offer sympathy 😊).

I would like my documentation to offer my users working examples but as far as I can tell all the cloud providers require that even "public data" be authenticated to access, making simple examples impossible to offer.
I depend on the Rust version of object_store. It is OK and I'm using it extensively under the covers. However, it's documentation is very bare bones (perhaps in part because of the first point above.)
The Python of object_store version is more limited, so instead of using it, I'm currently just having Python pass URLs and option strings. I hope that is OK with you and other future users.
Sadly, I can't find any nice documentation on creating cloud access URLs and option strings. This puts me in the position of telling my users "just create an URL for the cloud access", but not being able to point them to any good instructions on how to create such a URL.
I'd like to test and tune the performance of downloading parts of big files from the cloud. However, I'm afraid to host the data myself for fear of an unexpected bill. (Last year, I ran up a $200 on AWS when testing the Mac M2 version of BedReader -- I begged, and AWS did kindly forgive it.)
This feature increases the size of the bed-reader download from 1.5 meg to 7.5 meg. I need to investigate to see why. Maybe I misconfigured something or maybe this is just the cost of adding cloud support.
For a while, I thought I could expose async on the Python side. Instead, I now offer only a regular (non-async) interface which is much, much simpler for users. Under the covers, in the Rust code, it does use and offer async. I hope no Python users need direct async access.

None of this means the feature isn't worth adding. I think the state of cloud/async/etc. a bit more primitive that I would have thought.

Carl

============
Please try this beta version with cloud support:

pip install bed-reader[samples,sparse]==1.0.1b1

The documentation is here: https://fastlmm.github.io/bed-reader/1.0.1beta/

Here is sample usage:

import numpy as np
from bed_reader import open_bed

# Somehow, get your AWS credentials
import configparser, os  
config = configparser.ConfigParser()  
_ = config.read(os.path.expanduser("~/.aws/credentials"))  

# Create a dictionary with your AWS credentials and the AWS region.
cloud_options = {  
    "aws_access_key_id": config["default"].get("aws_access_key_id"),  
    "aws_secret_access_key": config["default"].get("aws_secret_access_key"),  
    "aws_region": "us-west-2"}  

# Open the bed file with a URL and any needed cloud options, then use as before.
with open_bed("s3://bedreader/v1/toydata.5chrom.bed", cloud_options=cloud_options) as bed:  
    val = bed.read(np.s_[:10, :10])  
val

hammer · 2024-01-11T22:27:21Z

Thanks Carl I'll check it out! To your point about the state of cloud/async being more primitive than expected, we've found the same thing. As one example the primary AWS Python library does not support asyncio: boto/botocore#458.

CarlKCarlK · 2024-01-22T03:19:34Z

I'm very excited because bed-reader can now efficiently read from regular web servers. This lets us, for example, read a SNP--almost instantly--directly from the S-BSST936 website!

import numpy as np
from bed_reader import open_bed
with open_bed(
    "https://www.ebi.ac.uk/biostudies/files/S-BSST936/genotypes/synthetic_v1_chr-10.bed",
    skip_format_check=True,
    iid_count=1_008_000,
    sid_count=361_561,
    ) as bed:
    val = bed.read(index=np.s_[:, 100_000], dtype=np.float32)
    np.mean(val) 
# outputs 0.033913...

More examples here: https://fastlmm.github.io/bed-reader/1.0.1beta/cloud_urls.html

Please install with
pip install bed-reader[samples,sparse]==1.0.1b2

The full docs are: The documentation is here: https://fastlmm.github.io/bed-reader/1.0.1beta/

Carl

hammer · 2024-01-22T15:57:51Z

Thanks Carl! I'm on vacation this week but am excited to try this out next week when I'm back.

CarlKCarlK · 2024-02-06T15:47:48Z

Support for cloud files is now released as version 1.02 on PiPy. Thanks for your suggestion!

(I also used this as an excuse to write an article about adding this feature to the Rust side of the code. It will hopefully soon be published on https://medium.com/@carlmkadie)

CarlKCarlK added the enhancement New feature or request label Dec 4, 2023

hammer mentioned this issue Dec 7, 2023

Low-latency storage backends cubed-dev/cubed#237

Open

hammer mentioned this issue Jan 11, 2024

read_plink from cloud storage sgkit-dev/sgkit#1150

Open

CarlKCarlK closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to read from cloud storage #22

Add ability to read from cloud storage #22

hammer commented Dec 4, 2023

CarlKCarlK commented Dec 4, 2023

hammer commented Dec 4, 2023

hammer commented Dec 4, 2023

CarlKCarlK commented Dec 4, 2023

hammer commented Dec 7, 2023

hammer commented Dec 7, 2023

CarlKCarlK commented Dec 7, 2023 via email

CarlKCarlK commented Dec 19, 2023

hammer commented Dec 21, 2023

CarlKCarlK commented Dec 21, 2023

hammer commented Jan 11, 2024

CarlKCarlK commented Jan 11, 2024

hammer commented Jan 11, 2024

CarlKCarlK commented Jan 22, 2024

hammer commented Jan 22, 2024

CarlKCarlK commented Feb 6, 2024

Add ability to read from cloud storage #22

Add ability to read from cloud storage #22

Comments

hammer commented Dec 4, 2023

CarlKCarlK commented Dec 4, 2023

hammer commented Dec 4, 2023

hammer commented Dec 4, 2023

CarlKCarlK commented Dec 4, 2023

hammer commented Dec 7, 2023

hammer commented Dec 7, 2023

CarlKCarlK commented Dec 7, 2023 via email

CarlKCarlK commented Dec 19, 2023

hammer commented Dec 21, 2023

CarlKCarlK commented Dec 21, 2023

hammer commented Jan 11, 2024

CarlKCarlK commented Jan 11, 2024

hammer commented Jan 11, 2024

CarlKCarlK commented Jan 22, 2024

hammer commented Jan 22, 2024

CarlKCarlK commented Feb 6, 2024