-
Notifications
You must be signed in to change notification settings - Fork 0
/
userguide.Rmd
282 lines (215 loc) Β· 10.2 KB
/
userguide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
title: R client for CollaboratorDB
author:
- name: Aaron Lun
email: [email protected]
package: zircon
date: "Revised: January 19, 2023"
output:
BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Using the CollaboratorDB R client}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE}
library(BiocStyle)
self <- Githubpkg("CollaboratorDB/CollaboratorDB-R", "CollaboratorDB");
knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
```
# Introduction
`r self` implements a simple R client for interacting with the **CollaboratorDB** API.
**CollaboratorDB** provides a publicly accessible store for Bioconductor objects based on the [schemas here](https://github.com/CollaboratorDB/CollaboratorDB-schemas),
and is intended to enable a smooth exchange of data and results between gRED scientists and their external collaborators.
Functionality is provided to read objects from the backend, to save objects in new projects, and to save new versions of existing projects.
Installation currently requires manual handling of a number of dependencies from GitHub;
this should hopefully be simplified once those same packages are accepted into Bioconductor.
```r
BiocManager::install("ArtifactDB/alabaster.base")
BiocManager::install("ArtifactDB/alabaster.matrix")
BiocManager::install("ArtifactDB/alabaster.ranges")
BiocManager::install("ArtifactDB/alabaster.se")
BiocManager::install("ArtifactDB/alabaster.sce")
BiocManager::install("ArtifactDB/alabaster.spatial")
BiocManager::install("ArtifactDB/alabaster.string")
BiocManager::install("ArtifactDB/alabaster.vcf")
BiocManager::install("ArtifactDB/alabaster.bumpy")
BiocManager::install("ArtifactDB/alabaster.mae")
BiocManager::install("ArtifactDB/zircon-R")
BiocManager::install("CollaboratorDB/CollaboratorDB-R")
```
# Listing versions and objects
Given a project name, we can list the available objects across all of its versions:
```{r}
library(CollaboratorDB)
listing <- listObjects("dssc-test_basic-2023")
names(listing) # all available versions
listing
```
The `id` field contains the identifier for each (non-child) object in this project.
```{r}
listing[[1]]$id
```
Other fields may contain useful metadata that was added by the author of the object.
```{r}
listing[[1]]$description
as.list(listing[[1]]$origin)
```
If we know the specific version of interest, we can just list objects from that version:
```{r}
listObjects("dssc-test_basic-2023", version="2023-01-19")
```
# Fetching an object
The `fetchObject()` function will load an R object from the **CollaboratorDB** backend, given the object's identifier:
```{r}
(id <- exampleID())
obj <- fetchObject(id)
obj
```
We can extract the metadata using the `objectAnnotation()` function:
```{r}
str(objectAnnotation(obj))
```
More complex objects can be loaded if the corresponding [**alabaster**](https://github.com/ArtifactDB/alabaster.base) packages are installed.
For example, we can load [`SingleCellExperiment`](https://bioconductor.org/packages/SingleCellExperiment) objects if `r Githubpkg("ArtifactDB/alabaster.sce")` is installed.
```{r}
fetchObject("dssc-test_basic-2023:my_first_sce@2023-01-19")
```
# Fetching multiple objects
We can grab all objects from a particular version of a project with the `fetchAllObjects()` function.
This loops through all the non-child resources in the project and pulls them into the R session.
```{r}
objects <- fetchAllObjects("dssc-test_basic-2023", version="2023-01-19")
objects
```
We can also fetch all objects for all versions of the project, if we don't know the right version ahead of time.
```{r}
versions <- fetchAllObjects("dssc-test_basic-2023")
names(versions) # version names
names(versions[[1]]) # object paths
```
However, these functions are relatively inefficient as they need to load all resources from file.
Prefer using `fetchObject()` explicitly in your scripts once the resource of interest is identified.
# Saving objects
Given some Bioconductor objects, we can annotate them with relevant metadata.
Most of these should be self-explanatory; the most novel field is `terms`, consisting of (optional) terms from some supported ontologies that enable easier programmatic annotation.
```{r}
library(S4Vectors)
df1 <- DataFrame(A=runif(10), B=rnorm(10), row.names=LETTERS[1:10])
df1 <- annotateObject(df1,
title="FOO",
description="Ich bien ein data frame",
authors="Aaron Lun <[email protected]>",
species=9606,
genome=list(list(id="hg38", source="UCSC")),
origin=list(list(source="PubMed", id="123456789")),
terms=list(list(id="EFO:0008896", source="Experimental Factor Ontology", version="v3.39.1"))
)
```
Then we save the object into a "staging directory" using `saveObject()`.
It's worth noting that only the object passed to `saveObject()` needs to be annotated with `annotateObject()`.
Child objects (e.g., nested `DataFrame`s in a `SummarizedExperiment`) are assumed to be described by the metadata of their parents,
though diligent uploaders are free to annotate the children if further detail needs to be added.
```{r}
staging <- tempfile()
dir.create(staging)
saveObject(df1, staging, "df001")
list.files(staging, recursive=TRUE)
```
Any name can be used for the objects, and multiple objects can be saved into the same directory.
Objects can even be saved into subdirectories:
```{r}
df2 <- DataFrame(A=runif(10), B=rnorm(10), row.names=LETTERS[1:10])
df2 <- annotateObject(df1,
title="BAR",
description="Je suis une data frame",
authors=list(list(name="Darth Vader", email="[email protected]", orcid="0000-0000-0000-0001")),
species=10090,
genome=list(list(id="GRCm38", source="Ensembl")),
origin=list(list(source="GEO", id="GSE123456"))
)
dir.create(file.path(staging, "variants"))
saveObject(df2, staging, "variants/df002")
list.files(staging, recursive=TRUE)
```
Once we're done with staging, we're ready to upload.
We pick a project name with the following format `<GROUP>-<TAG>-<YEAR>`:
- `GROUP` is the name of your group (e.g., `dssc`, `omnibx`, `oncbx`, `dsi`)
- `TAG` is some short string describing your project, using only alphanumeric characters and underscores
- `YEAR` is the current year
Then we call the `uploadDirectory()` function.
This will prompt us for a [GitHub personal access token](https://github.com/settings/tokens) to authenticate into the backend, if we haven't supplied one already.
π¨π¨π¨ **ALERT!**
To upload new projects, you must be either connected to the Roche corporate network, or be part of the [CollaboratorDB](https://github.com/CollaboratorDB) GitHub organization.
π¨π¨π¨
```{r, eval=FALSE}
# Setting an expiry date of 1 day in the future, to avoid having lots of
# testing projects lying around in the data store.
uploadDirectory(staging, project="dssc-test_vignette-2023", expires=1)
```
By default, the current date is used as the version string, but users can specify another versioning scheme if appropriate.
```{r, eval=FALSE}
# Alternative version:
uploadDirectory(staging, project="dssc-test_vignette-2023", version="v1", expires=1)
```
# Updating a project
The same `uploadDirectory()` call can be used to update an existing project by simply specifying another version:
```{r, eval=FALSE}
uploadDirectory(staging, project="dssc-test_vignette-2023", version="v2", expires=1)
```
In practice, our updates are performed long after the original staging directory has been deleted.
We can use the `cloneDirectory()` function to regenerate the staging directory for a previous version;
the contents of this directory can then be modified as desired prior to the `uploadDirectory()` call.
Customizations should be limited to removal of existing resources and addition of new resources.
(Renaming will not work as paths are hard-coded into the JSON files.)
```{r}
new.staging <- tempfile()
cloneDirectory(new.staging, project="dssc-test_basic-2023", version="2023-01-19")
# Applying some customizations.
unlink(file.path(new.staging, "variants"))
saveObject(df2, new.staging, "superfoobar")
# And then we can upload.
# uploadDirectory(new.staging, project="dssc-test_basic-2023", version="20XX-XX-XX")
```
That said, there are no restrictions on what constitutes a new version of a project.
There is no obligation for a new version's resources to overlap with those of a previous version (though the backend can more efficiently organize data if there is some overlap).
If warranted, users can completely change the objects within the project by creating an entirely new staging directory and uploading that as a new version.
# `DelayedArray` wrappers
For arrays, `r self` offers some special behavior during loading:
```{r}
mat <- fetchObject("dssc-test_basic-2023:my_first_sce/assay-1/matrix.h5@2023-01-19")
mat
```
The `CollaboratorDBArray` object is a [`DelayedArray`](https://bioconductor.org/packages/DelayedArray) subclass that remembers its ArtifactDB identifier.
This provides some optimization opportunities during the save/upload of the unmodified array,
and allows for cheap project updates, e.g., when editing the metadata or annotations of a `SummarizedExperiment` without touching the assay data.
The `CollaboratorDBArray` is created with file-backed arrays from the `r Biocpkg("HDF5Array")` package.
This ensures that it can be easily manipulated in an R session while maintaining a small memory footprint.
However, for analysis steps that actually use the array contents, it is best to convert the `CollaboratorDBArray` into an in-memory representation to avoid repeated disk queries:
```{r}
smat <- as(mat, "dgCMatrix")
str(smat)
```
# Searching for objects
π§π§π§ **Coming soon** π§π§π§
# Advanced usage
The **CollaboratorDB** API is just another ArtifactDB instance, so all methods in the `r Githubpkg("ArtifactDB/zircon-R", "zircon")` package can be used.
For example, we can directly fetch the metadata for individual components:
```{r}
library(zircon)
meta <- getFileMetadata(exampleID(), url=restURL())
str(meta$data_frame)
```
We can inspect the permissions for a project:
```{r}
getPermissions("dssc-test_basic-2023", url=restURL())
```
And we can pull down all metadata for a particular version of a project:
```{r}
v1.meta <- getProjectMetadata("dssc-test_tenx-2023", version="2023-01-19", url=restURL())
length(v1.meta)
```
# Session information {-}
```{r}
sessionInfo()
```