AWS Database #30

multimeric · 2022-11-23T05:36:35Z

Upload database to S3
Update R package to support loading from S3

multimeric · 2022-12-14T01:58:56Z

@stemangiola what do you think about this API:

get_SingleCellExperiment = function(
  .data,
  repository  = "~/hca_harmonized",
  remote_source = NULL,
  genes = NULL
)

The default behaviour would be to load the data from the repository path: the same behaviour as currently. However, when a file is requested that isn't present, and remote_source is not NULL (e.g. an S3 bucket s3://harmonised-human-atlas-original), then it will first download the necessary files from AWS, store them in repository, and then load them as normal.

We would then need to remove the WEHI default value for repository, and set it to their home directory by default. At WEHI we would then just tell everyone to use repository = "/vast/projects/RCP/human_cell_atlas/splitted_DB2_data", and they would never need to download anything.

I would also like to remove the assay argument since it's a bit difficult to reason with, considering it's just a shortcut to setting the repository. But I can keep it if necessary.

stemangiola · 2022-12-14T02:29:27Z

Can repository and remote_source be just one argument?

For example,

If repository is remote (URL?) use cache system
if not read from local.

multimeric · 2022-12-14T02:38:33Z

I thought about that, but the issue is that users need to be able to specify where the cache lives on their local filesystem. I could instead add a cache_dir which is only used when repository points to a remote data source, if you would prefer. But this still involves two parameters.

stemangiola · 2022-12-14T02:49:06Z

we could have the cache_directory as optional argument,

if NULL

cache will be a temporary directory by default for remote
the same as repository for local

if NOt NULL

cache will be that directory for remote and local, means that the user wants a private copy of the data (if this acceptable)?

multimeric · 2022-12-14T02:51:47Z

You think we should copy the data even if it exists locally? I don't know if most users would want that. e.g. when they get_SingleCellExperiment(repository="/some/path", cache="/some/other/path")

Also for the default cache, I was thinking home directory so that it's persistent. Would you rather a temporary directory?

stemangiola · 2022-12-14T02:54:41Z

You think we should copy the data even if it exists locally? I don't know if most users would want that.

by default not. Almost nobody will touch the cache parameter. If someone sets it's own cache is because they know what they are doing (probably) and it is an active decision (of course everything will be documented).

Also for the default cache, I was thinking home directory so that it's persistent. Would you rather a temporary directory?

Yes home make sense, but at WEHI for example home has a limited space, but you and Julie are for sure the best people to decide this.

multimeric · 2022-12-14T02:57:08Z

WEHI is a bit of a weird case, but I want this to work well for other institutes too. If I download to ~/.cache/hca_harmonised by default it should work everywhere, and some WEHI users (such as myself) have already symlinked this to VAST scratch.

stemangiola · 2022-12-14T03:00:53Z

All this could be even coded in the parameter argument to give transparency, for example

get_SingleCellExperiment = function(
repository,
cache_directory = repository |> is_url() |> ifelse( file.path(R.home, ".cache/hca_harmonised"), repository) # Default behaviour
)

multimeric · 2022-12-15T03:37:21Z

by default not. Almost nobody will touch the cache parameter. If someone sets it's own cache is because they know what they are doing (probably) and it is an active decision (of course everything will be documented).

So the issue is that we need the cache argument to have a default, because if the user is downloading from AWS, we always want it to be cached. However, this makes it difficult if the user specifies a local repository, because it will mean we duplicate that repository by default.

Either we ignore the local cache when the repository is a local URL, or we go back to my previous suggested API with remote_source. Which would you prefer?

stemangiola · 2022-12-15T03:42:21Z

Either we ignore the local cache when the repository is a local URL

I would prefer this.

In this case we can simply do

get_SingleCellExperiment = function(
repository,
cache_directory = file.path(R.home, ".cache/hca_harmonised") 
)

And document that this argument is used only for remote repository.

To create the cache directory if doesn't exist you can

cache_directory |> dirname() |> dir.create(showWarnings = FALSE, recursive = TRUE)

multimeric self-assigned this Nov 23, 2022

multimeric mentioned this issue Dec 16, 2022

Support remote repository urls #32

Merged

5 tasks

multimeric closed this as completed in #32 Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Database #30

AWS Database #30

multimeric commented Nov 23, 2022 •

edited

Loading

multimeric commented Dec 14, 2022

stemangiola commented Dec 14, 2022

multimeric commented Dec 14, 2022

stemangiola commented Dec 14, 2022 •

edited

Loading

multimeric commented Dec 14, 2022 •

edited

Loading

stemangiola commented Dec 14, 2022

multimeric commented Dec 14, 2022

stemangiola commented Dec 14, 2022 •

edited

Loading

multimeric commented Dec 15, 2022

stemangiola commented Dec 15, 2022 •

edited

Loading

AWS Database #30

AWS Database #30

Comments

multimeric commented Nov 23, 2022 • edited Loading

multimeric commented Dec 14, 2022

stemangiola commented Dec 14, 2022

multimeric commented Dec 14, 2022

stemangiola commented Dec 14, 2022 • edited Loading

multimeric commented Dec 14, 2022 • edited Loading

stemangiola commented Dec 14, 2022

multimeric commented Dec 14, 2022

stemangiola commented Dec 14, 2022 • edited Loading

multimeric commented Dec 15, 2022

stemangiola commented Dec 15, 2022 • edited Loading

multimeric commented Nov 23, 2022 •

edited

Loading

stemangiola commented Dec 14, 2022 •

edited

Loading

multimeric commented Dec 14, 2022 •

edited

Loading

stemangiola commented Dec 14, 2022 •

edited

Loading

stemangiola commented Dec 15, 2022 •

edited

Loading