Skip to content

Commit

Permalink
Merge pull request #49 from ccb-hms/ex_updates
Browse files Browse the repository at this point in the history
Ex updates
  • Loading branch information
andrewGhazi authored Oct 3, 2024
2 parents adc26ed + df67604 commit 12a699d
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 38 deletions.
6 changes: 6 additions & 0 deletions episodes/eda_qc.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,12 @@ head(hvg.sce.var)

Imagine you have data that were prepared by three people with varying level of experience, which leads to varying technical noise. How can you account for this blocking structure when selecting HVGs?

::: hint

`modelGeneVar()` can take a `block` argument.

:::

::: solution
Use the `block` argument in the call to `modelGeneVar()` like so:

Expand Down
74 changes: 36 additions & 38 deletions episodes/hca.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ exercises: 10 # Minutes of exercises in the lesson

::::::::::::::::::::::::::::::::::::::::::::::::


# Single Cell data sources

## HCA Project

The Human Cell Atlas (HCA) is a large project that aims to learn from and map
Expand Down Expand Up @@ -102,15 +105,15 @@ metadata <- get_metadata(remote_url = CuratedAtlasQueryR::SAMPLE_DATABASE_URL) |
collect()
```

Get a view of the first 10 columns in the metadata with `glimpse`
Get a view of the first 10 columns in the metadata with `glimpse()`

```{r}
metadata |>
select(1:10) |>
glimpse()
```

## A note on the pipe operator
## A tangent on the pipe operator

The vignette materials provided by `CuratedAtlasQueryR` show the use of the
'native' R pipe (implemented after R version `4.1.0`). For those not familiar
Expand All @@ -134,49 +137,51 @@ This command is equivalent to the following:
summarise(filter(mtcars, cyl != 4), mean_disp = mean(disp), .by = cyl)
```

## Summarizing the metadata
## Exploring the metadata

Let's examine the metadata to understand what information it contains.

For each distinct tissue and dataset combination, count the number of datasets
by tissue type.
We can tally the tissue types across datasets to see what tissues the experimental data come from:

```{r}
metadata |>
distinct(tissue, dataset_id) |>
count(tissue)
count(tissue) |>
arrange(-n)
```

## Columns available in the metadata
We can do the same for the assay types:

```{r, message = FALSE}
head(names(metadata), 10)
```{r}
metadata |>
distinct(assay, dataset_id) |>
count(assay)
```

:::: challenge

Glance over the full list of metadata column names. Do any other metadata columns jump out as interesting to you for your work?
Look through the full list of metadata column names. Do any other metadata
columns jump out as interesting to you for your work?

```{r eval=FALSE}
metadata |> names() |> sort()
names(metadata)
```

::::

## Available assays

```{r}
metadata |>
distinct(assay, dataset_id) |>
count(assay)
```

### Download single-cell RNA sequencing counts
## Downloading single cell data

The data can be provided as either "counts" or counts per million "cpm" as given
by the `assays` argument in the `get_single_cell_experiment()` function. By
default, the `SingleCellExperiment` provided will contain only the 'counts'
data.

For the sake of demonstration, we'll focus this small subset of samples:
For the sake of demonstration, we'll focus this small subset of samples. We use the `filter()` function from the `dplyr` package to identify cells meeting the following criteria:

* African ethnicity
* 10x assay
* lung parenchyma tissue
* CD4 cells

```{r}
sample_subset <- metadata |>
Expand All @@ -188,8 +193,9 @@ sample_subset <- metadata |>
)
```

Out of the `r nrow(metadata)` cells in the sample database, `r nrow(sample_subset)` cells meet this criteria.

#### Query raw counts
Now we can use `get_single_cell_experiment()`:

```{r, message = FALSE}
single_cell_counts <- sample_subset |>
Expand All @@ -198,17 +204,14 @@ single_cell_counts <- sample_subset |>
single_cell_counts
```

#### Query counts scaled per million

This is helpful if just few genes are of interest, as they can be compared
across samples.
You can provide different arguments to `get_single_cell_experiment()` to get different formats or subsets of the data, like data scaled to counts per million:

```{r, message = FALSE}
sample_subset |>
get_single_cell_experiment(assays = "cpm")
```

#### Extract only a subset of genes
or data on only specific genes:

```{r, message = FALSE}
single_cell_counts <- sample_subset |>
Expand All @@ -217,11 +220,9 @@ single_cell_counts <- sample_subset |>
single_cell_counts
```

#### Extracting counts as a Seurat object

If needed, the H5 `SingleCellExperiment` can be converted into a Seurat object.
Note that it may take a long time and use a lot of memory depending on how many
cells you are requesting.
Or if needed, the H5 `SingleCellExperiment` can be returned a Seurat
object (note that this may take a long time and use a lot of memory depending on
how many cells you are requesting).

```{r,eval=FALSE}
single_cell_counts <- sample_subset |>
Expand All @@ -230,13 +231,10 @@ single_cell_counts <- sample_subset |>
single_cell_counts
```

### Save your `SingleCellExperiment`

#### Saving as HDF5
## Save your `SingleCellExperiment`

The recommended way of saving these `SingleCellExperiment` objects, if
necessary, is to use `saveHDF5SummarizedExperiment` from the `HDF5Array`
package.
Once you have a dataset you're happy with, you'll probably want to save it. The recommended way of saving these `SingleCellExperiment` objects is to use
`saveHDF5SummarizedExperiment` from the `HDF5Array` package.

```{r, eval=FALSE}
single_cell_counts |> saveHDF5SummarizedExperiment("single_cell_counts")
Expand Down

0 comments on commit 12a699d

Please sign in to comment.