vignettes/userguide.Rmd

---
title: R client for CollaboratorDB
author:
- name: Aaron Lun
  email: infinite.monkeys.with.keyboards@gmail.com
package: zircon
date: "Revised: January 19, 2023"
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Using the CollaboratorDB R client}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
library(BiocStyle)
self <- Githubpkg("CollaboratorDB/CollaboratorDB-R", "CollaboratorDB");
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
```

# Introduction

`r self` implements a simple R client for interacting with the **CollaboratorDB** API.
**CollaboratorDB** provides a publicly accessible store for Bioconductor objects based on the [schemas here](https://github.com/CollaboratorDB/CollaboratorDB-schemas),
and is intended to enable a smooth exchange of data and results between gRED scientists and their external collaborators.
Functionality is provided to read objects from the backend, to save objects in new projects, and to save new versions of existing projects.

Installation currently requires manual handling of a number of dependencies from GitHub;
this should hopefully be simplified once those same packages are accepted into Bioconductor.

```r
BiocManager::install("ArtifactDB/alabaster.base")
BiocManager::install("ArtifactDB/alabaster.matrix")
BiocManager::install("ArtifactDB/alabaster.ranges")
BiocManager::install("ArtifactDB/alabaster.se")
BiocManager::install("ArtifactDB/alabaster.sce")
BiocManager::install("ArtifactDB/alabaster.spatial")
BiocManager::install("ArtifactDB/alabaster.string")
BiocManager::install("ArtifactDB/alabaster.vcf")
BiocManager::install("ArtifactDB/alabaster.bumpy")
BiocManager::install("ArtifactDB/alabaster.mae")
BiocManager::install("ArtifactDB/zircon-R")
BiocManager::install("CollaboratorDB/CollaboratorDB-R")
```

# Listing versions and objects

Given a project name, we can list the available objects across all of its versions:

```{r}
library(CollaboratorDB)
listing <- listObjects("dssc-test_basic-2023")
names(listing) # all available versions
listing
```

The `id` field contains the identifier for each (non-child) object in this project.

```{r}
listing[[1]]$id
```

Other fields may contain useful metadata that was added by the author of the object.

```{r}
listing[[1]]$description
as.list(listing[[1]]$origin)
```

If we know the specific version of interest, we can just list objects from that version:

```{r}
listObjects("dssc-test_basic-2023", version="2023-01-19")
```

# Fetching an object 

The `fetchObject()` function will load an R object from the **CollaboratorDB** backend, given the object's identifier:

```{r}
(id <- exampleID())
obj <- fetchObject(id)
obj
```

We can extract the metadata using the `objectAnnotation()` function:

```{r}
str(objectAnnotation(obj))
```

More complex objects can be loaded if the corresponding [**alabaster**](https://github.com/ArtifactDB/alabaster.base) packages are installed.
For example, we can load [`SingleCellExperiment`](https://bioconductor.org/packages/SingleCellExperiment) objects if `r Githubpkg("ArtifactDB/alabaster.sce")` is installed.

```{r}
fetchObject("dssc-test_basic-2023:my_first_sce@2023-01-19")
```

# Fetching multiple objects

We can grab all objects from a particular version of a project with the `fetchAllObjects()` function.
This loops through all the non-child resources in the project and pulls them into the R session.

```{r}
objects <- fetchAllObjects("dssc-test_basic-2023", version="2023-01-19")
objects
```

We can also fetch all objects for all versions of the project, if we don't know the right version ahead of time.

```{r}
versions <- fetchAllObjects("dssc-test_basic-2023")
names(versions) # version names
names(versions[[1]]) # object paths
```

However, these functions are relatively inefficient as they need to load all resources from file.
Prefer using `fetchObject()` explicitly in your scripts once the resource of interest is identified.

# Saving objects 

Given some Bioconductor objects, we can annotate them with relevant metadata.
Most of these should be self-explanatory; the most novel field is `terms`, consisting of (optional) terms from some supported ontologies that enable easier programmatic annotation.

```{r}
library(S4Vectors)
df1 <- DataFrame(A=runif(10), B=rnorm(10), row.names=LETTERS[1:10])
df1 <- annotateObject(df1, 
    title="FOO",
    description="Ich bien ein data frame",
    authors="Aaron Lun <infinite.monkeys.with.keyboards@gmail.com>",
    species=9606,
    genome=list(list(id="hg38", source="UCSC")),
    origin=list(list(source="PubMed", id="123456789")),
    terms=list(list(id="EFO:0008896", source="Experimental Factor Ontology", version="v3.39.1"))
)
```

Then we save the object into a "staging directory" using `saveObject()`.
It's worth noting that only the object passed to `saveObject()` needs to be annotated with `annotateObject()`.
Child objects (e.g., nested `DataFrame`s in a `SummarizedExperiment`) are assumed to be described by the metadata of their parents,
though diligent uploaders are free to annotate the children if further detail needs to be added.

```{r}
staging <- tempfile()
dir.create(staging)
saveObject(df1, staging, "df001")
list.files(staging, recursive=TRUE)
```

Any name can be used for the objects, and multiple objects can be saved into the same directory.
Objects can even be saved into subdirectories:

```{r}
df2 <- DataFrame(A=runif(10), B=rnorm(10), row.names=LETTERS[1:10])
df2 <- annotateObject(df1, 
    title="BAR",
    description="Je suis une data frame",
    authors=list(list(name="Darth Vader", email="vader@empire.gov", orcid="0000-0000-0000-0001")),
    species=10090,
    genome=list(list(id="GRCm38", source="Ensembl")),
    origin=list(list(source="GEO", id="GSE123456"))
)

dir.create(file.path(staging, "variants"))
saveObject(df2, staging, "variants/df002")
list.files(staging, recursive=TRUE)
```

Once we're done with staging, we're ready to upload.
We pick a project name with the following format `<GROUP>-<TAG>-<YEAR>`:

- `GROUP` is the name of your group (e.g., `dssc`, `omnibx`, `oncbx`, `dsi`)
- `TAG` is some short string describing your project, using only alphanumeric characters and underscores
- `YEAR` is the current year

Then we call the `uploadDirectory()` function.
This will prompt us for a [GitHub personal access token](https://github.com/settings/tokens) to authenticate into the backend, if we haven't supplied one already.

🚨🚨🚨 **ALERT!**
To upload new projects, you must be either connected to the Roche corporate network, or be part of the [CollaboratorDB](https://github.com/CollaboratorDB) GitHub organization.
🚨🚨🚨 

```{r, eval=FALSE}
# Setting an expiry date of 1 day in the future, to avoid having lots of
# testing projects lying around in the data store.
uploadDirectory(staging, project="dssc-test_vignette-2023", expires=1)
```

By default, the current date is used as the version string, but users can specify another versioning scheme if appropriate.

```{r, eval=FALSE}
# Alternative version:
uploadDirectory(staging, project="dssc-test_vignette-2023", version="v1", expires=1)
```

# Updating a project

The same `uploadDirectory()` call can be used to update an existing project by simply specifying another version:

```{r, eval=FALSE}
uploadDirectory(staging, project="dssc-test_vignette-2023", version="v2", expires=1)
```

In practice, our updates are performed long after the original staging directory has been deleted.
We can use the `cloneDirectory()` function to regenerate the staging directory for a previous version;
the contents of this directory can then be modified as desired prior to the `uploadDirectory()` call.
Customizations should be limited to removal of existing resources and addition of new resources.
(Renaming will not work as paths are hard-coded into the JSON files.)

```{r}
new.staging <- tempfile()
cloneDirectory(new.staging, project="dssc-test_basic-2023", version="2023-01-19")

# Applying some customizations.
unlink(file.path(new.staging, "variants"))
saveObject(df2, new.staging, "superfoobar")

# And then we can upload.
# uploadDirectory(new.staging, project="dssc-test_basic-2023", version="20XX-XX-XX")
```

That said, there are no restrictions on what constitutes a new version of a project.
There is no obligation for a new version's resources to overlap with those of a previous version (though the backend can more efficiently organize data if there is some overlap).
If warranted, users can completely change the objects within the project by creating an entirely new staging directory and uploading that as a new version.

# `DelayedArray` wrappers 

For arrays, `r self` offers some special behavior during loading:

```{r}
mat <- fetchObject("dssc-test_basic-2023:my_first_sce/assay-1/matrix.h5@2023-01-19")
mat
```

The `CollaboratorDBArray` object is a [`DelayedArray`](https://bioconductor.org/packages/DelayedArray) subclass that remembers its ArtifactDB identifier.
This provides some optimization opportunities during the save/upload of the unmodified array,
and allows for cheap project updates, e.g., when editing the metadata or annotations of a `SummarizedExperiment` without touching the assay data.

The `CollaboratorDBArray` is created with file-backed arrays from the `r Biocpkg("HDF5Array")` package.
This ensures that it can be easily manipulated in an R session while maintaining a small memory footprint.
However, for analysis steps that actually use the array contents, it is best to convert the `CollaboratorDBArray` into an in-memory representation to avoid repeated disk queries:

```{r}
smat <- as(mat, "dgCMatrix")
str(smat)
```

# Searching for objects

🚧🚧🚧 **Coming soon** 🚧🚧🚧

# Advanced usage

The **CollaboratorDB** API is just another ArtifactDB instance, so all methods in the `r Githubpkg("ArtifactDB/zircon-R", "zircon")` package can be used.
For example, we can directly fetch the metadata for individual components:

```{r}
library(zircon)
meta <- getFileMetadata(exampleID(), url=restURL())
str(meta$data_frame)
```

We can inspect the permissions for a project:

```{r}
getPermissions("dssc-test_basic-2023", url=restURL())
```

And we can pull down all metadata for a particular version of a project:

```{r}
v1.meta <- getProjectMetadata("dssc-test_tenx-2023", version="2023-01-19", url=restURL())
length(v1.meta)
```

# Session information {-}

```{r}
sessionInfo()
```