Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Database #30

Closed
1 of 2 tasks
multimeric opened this issue Nov 23, 2022 · 10 comments · Fixed by #32
Closed
1 of 2 tasks

AWS Database #30

multimeric opened this issue Nov 23, 2022 · 10 comments · Fixed by #32
Assignees

Comments

@multimeric
Copy link
Collaborator

multimeric commented Nov 23, 2022

  • Upload database to S3
  • Update R package to support loading from S3
@multimeric multimeric self-assigned this Nov 23, 2022
@multimeric
Copy link
Collaborator Author

@stemangiola what do you think about this API:

get_SingleCellExperiment = function(
  .data,
  repository  = "~/hca_harmonized",
  remote_source = NULL,
  genes = NULL
)

The default behaviour would be to load the data from the repository path: the same behaviour as currently. However, when a file is requested that isn't present, and remote_source is not NULL (e.g. an S3 bucket s3://harmonised-human-atlas-original), then it will first download the necessary files from AWS, store them in repository, and then load them as normal.

We would then need to remove the WEHI default value for repository, and set it to their home directory by default. At WEHI we would then just tell everyone to use repository = "/vast/projects/RCP/human_cell_atlas/splitted_DB2_data", and they would never need to download anything.

I would also like to remove the assay argument since it's a bit difficult to reason with, considering it's just a shortcut to setting the repository. But I can keep it if necessary.

@stemangiola
Copy link
Owner

Can repository and remote_source be just one argument?

For example,

  • If repository is remote (URL?) use cache system
  • if not read from local.

@multimeric
Copy link
Collaborator Author

I thought about that, but the issue is that users need to be able to specify where the cache lives on their local filesystem. I could instead add a cache_dir which is only used when repository points to a remote data source, if you would prefer. But this still involves two parameters.

@stemangiola
Copy link
Owner

stemangiola commented Dec 14, 2022

we could have the cache_directory as optional argument,

if NULL

  • cache will be a temporary directory by default for remote
  • the same as repository for local

if NOt NULL

  • cache will be that directory for remote and local, means that the user wants a private copy of the data (if this acceptable)?

@multimeric
Copy link
Collaborator Author

multimeric commented Dec 14, 2022

You think we should copy the data even if it exists locally? I don't know if most users would want that. e.g. when they get_SingleCellExperiment(repository="/some/path", cache="/some/other/path")

Also for the default cache, I was thinking home directory so that it's persistent. Would you rather a temporary directory?

@stemangiola
Copy link
Owner

You think we should copy the data even if it exists locally? I don't know if most users would want that.

by default not. Almost nobody will touch the cache parameter. If someone sets it's own cache is because they know what they are doing (probably) and it is an active decision (of course everything will be documented).

Also for the default cache, I was thinking home directory so that it's persistent. Would you rather a temporary directory?

Yes home make sense, but at WEHI for example home has a limited space, but you and Julie are for sure the best people to decide this.

@multimeric
Copy link
Collaborator Author

WEHI is a bit of a weird case, but I want this to work well for other institutes too. If I download to ~/.cache/hca_harmonised by default it should work everywhere, and some WEHI users (such as myself) have already symlinked this to VAST scratch.

@stemangiola
Copy link
Owner

stemangiola commented Dec 14, 2022

All this could be even coded in the parameter argument to give transparency, for example

get_SingleCellExperiment = function(
repository,
cache_directory = repository |> is_url() |> ifelse( file.path(R.home, ".cache/hca_harmonised"), repository) # Default behaviour
)

@multimeric
Copy link
Collaborator Author

by default not. Almost nobody will touch the cache parameter. If someone sets it's own cache is because they know what they are doing (probably) and it is an active decision (of course everything will be documented).

So the issue is that we need the cache argument to have a default, because if the user is downloading from AWS, we always want it to be cached. However, this makes it difficult if the user specifies a local repository, because it will mean we duplicate that repository by default.

Either we ignore the local cache when the repository is a local URL, or we go back to my previous suggested API with remote_source. Which would you prefer?

@stemangiola
Copy link
Owner

stemangiola commented Dec 15, 2022

Either we ignore the local cache when the repository is a local URL

I would prefer this.

In this case we can simply do

get_SingleCellExperiment = function(
repository,
cache_directory = file.path(R.home, ".cache/hca_harmonised") 
)

And document that this argument is used only for remote repository.

To create the cache directory if doesn't exist you can

cache_directory |> dirname() |> dir.create(showWarnings = FALSE, recursive = TRUE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants