Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support remote repository urls #32

Merged
merged 20 commits into from
Jan 9, 2023
Merged

Support remote repository urls #32

merged 20 commits into from
Jan 9, 2023

Conversation

multimeric
Copy link
Collaborator

@multimeric multimeric commented Dec 16, 2022

Closes #30.

User-Facing Changes

  • get_SingleCellExperiment
    • repository argument is now the HTTP URL of a remote data source, with a default value pointing to AWS.
    • Added cache_directory argument, which points to a local cache.
    • Multiple assays can now be specified. If they are, the resulting SCE will have multiple assays available.
    • Added progress bars and messages:
      image
    • Removed the default local path for the repository argument, since this was WEHI-specific
    • The "counts_per_millon" assay has been renamed to "cpm", according to good Bioconductor practise: https://rdrr.io/bioc/SingleCellExperiment/man/assays.html
    • The .data argument has been renamed to data according to Tidyverse style: https://design.tidyverse.org/dots-prefix.html
    • The genes argument has been renamed to features
    • The assay argument has been renamed to assays
  • get_metadata
    • Now has repository and cache_directory argument that work similarly to get_SingleCellExperiment

Internal Changes

  • Rewrite most of get_SingleCellExperiment to have tidier code
  • Added sync_remote_file, sync_assay_files, and get_default_cache_dir, which are utility functions used for this feature
  • Added tests for these new functions and arguments

TODO

  • Add back in assay argument, which accepts a string
  • Move data into subdirectories of the same bucket
  • Skip the download if the files are already present
  • Add tests for the multi assay behaviour
  • Clean up unused imports

@stemangiola
Copy link
Owner

  • Removed assay argument from get_SingleCellExperiment

For the future, production package the remote URL will be a default argument, and the assay argument will become relevant again.

We can imagine the users (90%) query from remote, never touching the repository or cache argument, but just interacting with the assay and gene arguments.

@multimeric
Copy link
Collaborator Author

multimeric commented Dec 16, 2022

For the future, production package the remote URL will be a default argument

Agreed

and the assay argument will become relevant again.

I find this argument a bit weird because it only makes sense if the user is using the default repository. What does it mean if the user uses get_SingleCellExperiment(repository="/some/path/db2_scaled", assay="counts")?

Could we maybe have repository = raw_counts() or repository = scaled_counts(), where both raw_counts and scaled_counts are functions that return a remote URL? Because then we don't need to add an assay argument.

@stemangiola
Copy link
Owner

For the future, production package the remote URL will be a default argument

Agreed

and the assay argument will become relevant again.

I find this argument a bit weird because it only makes sense if the user is using the default repository. What does it mean if the user uses get_SingleCellExperiment(repository="/some/path/db2_scaled", assay="counts")?

Could we maybe have repository = raw_counts() or repository = scaled_counts(), where both raw_counts and scaled_counts are functions that return a remote URL? Because then we don't need to add an assay argument.

I see.

OK, (I am thinking to a leaner solution) how about for the future, the only arguments available will be

main

  • assay
  • gene

secondary

  • cache_directory

If a user institute downloads the local copy of our database, cache_directory will be set to that local location by the user and will work as cache. If cache won't be specified, we will set the cache and fill it with remote downloads.

I am thinking in the future, repository argument will never be needed.
cache_directory

@multimeric
Copy link
Collaborator Author

That works, although it will lose the flexibility of having a configurable URL. If we follow my raw_counts() API then it's potentially flexible if we ever e.g. provide another copy in another place or using another protocol (other than HTTP).

Also, if you want assay to stay, how should it work now, when we don't have the remote URL set as a default?

@stemangiola
Copy link
Owner

That works, although it will lose the flexibility of having a configurable URL. If we follow my raw_counts() API then it's potentially flexible if we ever e.g. provide another copy in another place or using another protocol (other than HTTP).

I think we need a brief chat at some point.

Also, if you want assay to stay, how should it work now when we don't have the remote URL set as a default?

Let's forget about assay for the moment.

@multimeric
Copy link
Collaborator Author

Should I support requesting both assays? ie get_SingleCellExperiment(assay=c("raw", "scaled"))?

@stemangiola
Copy link
Owner

stemangiola commented Dec 19, 2022

get_SingleCellExperiment(assay=c("raw", "scaled"))

[EDIT] well thinking about this, yes (if it is not incredibly complex).

Copy link
Owner

@stemangiola stemangiola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @multimeric ! The only change I would ask is argument names

  • cache_dir -> cache_directory (no abbreviation policy, which I also try to use in the back end)
  • genes -> feature (singular policy for arguments, and feature respects more the standards)

Well done with the unit tests!

@multimeric
Copy link
Collaborator Author

Hmm as a user I find it a bit unintuitive to say feature = c("some_gene", "other_gene") but if you think that's a good practise then I'm happy to follow it.

@stemangiola
Copy link
Owner

Hmm as a user I find it a bit unintuitive to say feature = c("some_gene", "other_gene") but if you think that's a good practise then I'm happy to follow it.

Oki we can use features

@multimeric
Copy link
Collaborator Author

My justification is functions like SingleCellExperiment which have plural arguments like assays, reducedDims.

@multimeric
Copy link
Collaborator Author

The CI isn't working, but here is the lint outputs from running it locally:

── R CMD check results ───────────────────────────────────── HCAquery 0.1.0 ────
Duration: 11m 14.9s

0 errors ✔ | 0 warnings ✔ | 0 notes ✔
> BiocCheck::BiocCheck()
─ BiocCheckVersion: 1.32.1
─ BiocVersion: 3.15
─ Package: HCAquery
─ PackageVersion: 0.1.0
─ sourceDir: /stornext/Home/data/allstaff/m/milton.m/HCAquery
─ installDir: /vast/scratch/users/milton.m/tmp/RtmpCvJCIh/file10bc131287bc
─ BiocCheckDir: /stornext/Home/data/allstaff/m/milton.m/HCAquery.BiocCheck
─ platform: unix
─ isTarBall: FALSE

* Installing package...
* Checking package dependencies...
* Checking if other packages can import this one...
* Checking to see if we understand object initialization...
* Checking for deprecated package usage...
* Checking for remote package usage...
* Checking for 'LazyData: true' usage...
* Checking version number...
* Checking version number validity...
    Package version 0.1.0; pre-release
* Checking R version dependency...
* Checking package size...
      Skipped... only checked on source tarball
* Checking individual file sizes...
* Checking biocViews...
* Checking that biocViews are present...
* Checking package type based on biocViews...
    Software
* Checking for non-trivial biocViews...
* Checking that biocViews come from the same category...
* Checking biocViews validity...
* Checking for recommended biocViews...
* Checking build system compatibility...
* Checking for blank lines in DESCRIPTION...
* Checking if DESCRIPTION is well formatted...
* Checking for proper Description: field...
* Checking for whitespace in DESCRIPTION field names...
* Checking that Package field matches directory/tarball name...
* Checking for Version field...
* Checking for valid maintainer...
* Checking License: for restrictive use...
* Checking for pinned package versions...
* Checking DESCRIPTION/NAMESPACE consistency...
* Checking .Rbuildignore...
* Checking for stray BiocCheck output folders...
* Checking vignette directory...
    * NOTE: Potential intermediate files found:
    * NOTE: 'sessionInfo' not found in vignette(s)
* Checking package installation calls in R code...
* Checking for library/require of HCAquery...
* Checking coding practice...
* Checking parsed R code in R directory, examples, vignettes...
* Checking function lengths...
    * NOTE: Recommended function length <= 50 lines.There are 1
      functions > 50 lines.
* Checking man page documentation...
* Checking package NEWS...
* Checking unit tests...
* Checking skip_on_bioc() in tests...
* Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source,
  and vignette source...
    * NOTE: Consider shorter lines; 12 lines (2%) are > 80 characters
      long.
    * NOTE: Consider multiples of 4 spaces for line indents; 26 lines
      (4%) are not.
    See http://bioconductor.org/developers/how-to/coding-style/
    See styler package: https://cran.r-project.org/package=styler as
      described in the BiocCheck vignette.
* Checking if package already exists in CRAN...
'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.rstudio.com/

* Checking for bioc-devel mailing list subscription...
    * NOTE: Cannot determine whether maintainer is subscribed to the
      Bioc-Devel mailing list (requires admin credentials). Subscribe
      here: https://stat.ethz.ch/mailman/listinfo/bioc-devel
* Checking for support site registration...
    Maintainer is registered at support site.
    * ERROR: Maintainer must add package name to Watched Tags on the
      support site; Edit your Support Site User Profile to add Watched
      Tags.

─ BiocCheck results ──
1 ERRORS | 0 WARNINGS | 6 NOTES

@multimeric multimeric merged commit a168755 into master Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWS Database
2 participants