Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow selection of AWS vs GCP as a source #102

Open
ahmedhosny opened this issue Jul 25, 2024 · 5 comments
Open

Allow selection of AWS vs GCP as a source #102

ahmedhosny opened this issue Jul 25, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@ahmedhosny
Copy link

ahmedhosny commented Jul 25, 2024

I see that there is embedded logic that selects between them here. I am curious if one, e.g. AWS, can be enforced (similar to what can be done on the IDC website when you hit "Download Images").

Re the distribution of data across AWS and GCP, it seems like the data is not mirrored across them?

@vkt1414
Copy link
Collaborator

vkt1414 commented Jul 25, 2024

@ahmedhosny Thank you for creating this issue.

Yes, we certainly can provide this option to the user. At the time of implementing this, I did not think very hard about aws to gcs bucket mapping. After writing a simple query now to check aws to gcs bucket mapping, we certainly can predict which bucket the data will be present, from any given aws or gcs url.

Currently our index has only aws urls, a choice thoughtfully made, as we can track how much data is being downloaded from AWS. On the other hand, we don't have a way to track GCS downloads. Regardless, with the below bucket mapping, we can predict the gcs urls from aws urls. And we do mirror data i.e aws and gcs buckets are identical clones.

This is a relatively easy feature to implement, and we'll keep you posted.

image

@fedorov
Copy link
Member

fedorov commented Jul 31, 2024

@ahmedhosny could elaborate on the background for your suggestion?

I am interested if current support of download from AWS only has any significant consequences. I assumed there isn't, since data egress is free, and I would expect performance of the download should be similar from AWS and GCP. The only potential downside I could think of is if one wants to use GCP-native tools for download. Is there anything else we missed?

@fedorov fedorov added the enhancement New feature or request label Jul 31, 2024
@ahmedhosny
Copy link
Author

@fedorov We are running s5cmd commands from an AWS EC2 instance. Some basic testing showed that downloading from AWS to AWS was ~30% faster than GCP to AWS. Not sure if there are some "within system" benefits?

@vkt1414 Thank you for the background there. We added the aws_bucket column to our query.

P.S. Very impressed with IDC and the maturity it has reached. Great work 👏

@fedorov
Copy link
Member

fedorov commented Jul 31, 2024

@ahmedhosny thank you for the clarification!

Note, however, that you can use idc-index directly to download from IDC, and it will be downloading from AWS by default.

It is as simple as the following:

  1. install: pip install --upgrade idc-index
  2. download NLST collection: idc download nlst
  3. download specific patient: idc download <patient_id> (same idea for study/series)
  4. download content defined by a manifest created using IDC Portal: idc download <manifest_file>

Behind the scenes, pip install will install s5cmd, and idc download command line will use s5cmd to fetch the files from the AWS buckets, while also organizing them into collection/patient/study/series folder hierarchy.

Happy to jump on a call to guide you and get your feedback!

Sorry, documentation is behind ....

@ahmedhosny
Copy link
Author

Thanks @fedorov! Yes, we plan to switch to idc-index now - much cleaner than managing massive json files with BQ query results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants