Allow selection of AWS vs GCP as a source #102

ahmedhosny · 2024-07-25T01:37:26Z

I see that there is embedded logic that selects between them here. I am curious if one, e.g. AWS, can be enforced (similar to what can be done on the IDC website when you hit "Download Images").

Re the distribution of data across AWS and GCP, it seems like the data is not mirrored across them?

vkt1414 · 2024-07-25T02:59:14Z

@ahmedhosny Thank you for creating this issue.

Yes, we certainly can provide this option to the user. At the time of implementing this, I did not think very hard about aws to gcs bucket mapping. After writing a simple query now to check aws to gcs bucket mapping, we certainly can predict which bucket the data will be present, from any given aws or gcs url.

Currently our index has only aws urls, a choice thoughtfully made, as we can track how much data is being downloaded from AWS. On the other hand, we don't have a way to track GCS downloads. Regardless, with the below bucket mapping, we can predict the gcs urls from aws urls. And we do mirror data i.e aws and gcs buckets are identical clones.

This is a relatively easy feature to implement, and we'll keep you posted.

fedorov · 2024-07-31T14:21:40Z

@ahmedhosny could elaborate on the background for your suggestion?

I am interested if current support of download from AWS only has any significant consequences. I assumed there isn't, since data egress is free, and I would expect performance of the download should be similar from AWS and GCP. The only potential downside I could think of is if one wants to use GCP-native tools for download. Is there anything else we missed?

ahmedhosny · 2024-07-31T22:42:28Z

@fedorov We are running s5cmd commands from an AWS EC2 instance. Some basic testing showed that downloading from AWS to AWS was ~30% faster than GCP to AWS. Not sure if there are some "within system" benefits?

@vkt1414 Thank you for the background there. We added the aws_bucket column to our query.

P.S. Very impressed with IDC and the maturity it has reached. Great work 👏

fedorov · 2024-07-31T22:55:25Z

@ahmedhosny thank you for the clarification!

Note, however, that you can use idc-index directly to download from IDC, and it will be downloading from AWS by default.

It is as simple as the following:

install: pip install --upgrade idc-index
download NLST collection: idc download nlst
download specific patient: idc download <patient_id> (same idea for study/series)
download content defined by a manifest created using IDC Portal: idc download <manifest_file>

Behind the scenes, pip install will install s5cmd, and idc download command line will use s5cmd to fetch the files from the AWS buckets, while also organizing them into collection/patient/study/series folder hierarchy.

Happy to jump on a call to guide you and get your feedback!

Sorry, documentation is behind ....

ahmedhosny · 2024-07-31T23:19:05Z

Thanks @fedorov! Yes, we plan to switch to idc-index now - much cleaner than managing massive json files with BQ query results.

fedorov added the enhancement New feature or request label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow selection of AWS vs GCP as a source #102

Allow selection of AWS vs GCP as a source #102

ahmedhosny commented Jul 25, 2024 •

edited

Loading

vkt1414 commented Jul 25, 2024

fedorov commented Jul 31, 2024

ahmedhosny commented Jul 31, 2024

fedorov commented Jul 31, 2024 •

edited

Loading

ahmedhosny commented Jul 31, 2024

Allow selection of AWS vs GCP as a source #102

Allow selection of AWS vs GCP as a source #102

Comments

ahmedhosny commented Jul 25, 2024 • edited Loading

vkt1414 commented Jul 25, 2024

fedorov commented Jul 31, 2024

ahmedhosny commented Jul 31, 2024

fedorov commented Jul 31, 2024 • edited Loading

ahmedhosny commented Jul 31, 2024

ahmedhosny commented Jul 25, 2024 •

edited

Loading

fedorov commented Jul 31, 2024 •

edited

Loading