tidy-intro-talk.qmd

---
title: "Tidy Intro Talk"
author: "Michael Love"
format: html
highlight-style: oblivion
editor: visual
---

## Tidyomics project

An open, open-source project spanning multiple R packages, and
developers from around the world. Organized as a GitHub organization
with GitHub Projects. For more:

-   <https://github.com/tidyomics>

-   <https://www.biorxiv.org/content/10.1101/2023.09.10.557072v2>

-   `tidiness_in_bioc` channel in Bioconductor Slack

## Diagram of tidyomics workflows

![](figures/figure2.png){fig-alt="Diagram of how packages share a similar grammar to operate on data objects. From top to bottom, the data are analyzed, manipulated, and made into plots."}

## International development team

![](figures/tidyomics_community.png){fig-alt="Diagram of tidyomics community, with users and developers interacting. On the top are users with arrows coming from developers and packages. On the bottom is the extended community, including Bioconductor."}

## Objects keep data organized

-   We typically have more information than just a matrix

-   Row and column information (on features and samples)

-   Metadata on organism, genome build, annotation release, etc.

-   Keeping this information altogether is the motivation for *data
    objects*

-   Many functions in Bioconductor are *endomorphic* meaning that an
    object is passed in, some data may be added/modified, then the
    object is passed back out

I will first introduce one of the main data objects in Bioconductor,
the *SummarizedExperiment*.

After introducing this, I'll motivate `tidyomics` as applied to
Bioconductor objects.

A *SummarizedExperiment* is built from three tables:

```{r}
#| message: false
library(SummarizedExperiment)
# metadata about genes
genes <- DataFrame(
  id = c("g1","g2","g3","g4"), 
  symbol = c("ABC","DEF","GHI","JKL")
  )
genes
```

```{r}
# metadata about samples
samples <- DataFrame(
  sample = c("s1","s2","s3","s4"),
  condition = c("x","y","x","y"),
  sex = c("m","m","f","f")
  )
samples
```

```{r}
set.seed(123) # some random count data
counts <- matrix(rpois(16, lambda=100), ncol=4)
rownames(counts) <- genes$id
colnames(counts) <- samples$sample
counts
```

```{r}
se <- SummarizedExperiment(
  assays = list(counts = counts),
  rowData = genes, 
  colData = samples,
  metadata = list(organism="Hsapiens")
  )
se
```

## Avoid common bookkeeping errors

Reordering samples (columns) is a global operation:

```{r}
se2 <- se[,c(1,3,2,4)]
assay(se2, "counts")
colData(se2)
```

Assignment that might result in sample swap results in an error:

```{r}
#| eval: false
assay(se2) <- counts
# Error:
# please use 'assay(x, withDimnames=FALSE)) <- value' or 
# 'assays(x, withDimnames=FALSE)) <- value'
# when the rownames or colnames of the supplied assay(s) are not 
# identical to those of the receiving  SummarizedExperiment object 'x'
```

Other such validity checks include comparison of genomic ranges across
different genome builds: will result in an error.

## Can be hard for new users

```{r}
slotNames(se)
methods(class = class(se))
```

## Many beginners know `dplyr`/`ggplot2`

```{r}
#| message: false
library(dplyr)
# filter to samples in condition 'x'
samples |> 
  as_tibble() |> 
  filter(condition == "x")
```

## Enabling dplyr verbs for omics

```{r}
#| message: false
library(tidySummarizedExperiment)
se
```

What does this mean "*SE-tibble abstraction*"?

Essentially this is an API, we can use our familiar verbs and interact
with the native object.

![](figures/counter.png){fig-alt="Picture of a counter with a menu and a bell"}

## Still a standard Bioc object

```{r}
class(se)
dim(se)
```

## We can use familiar dplyr verbs

```{r}
se |> 
  filter(condition == "x")
```

```{r}
se_sub <- se |> 
  filter(condition == "x")
colData(se_sub)
```

## Facilitates quick exploration

```{r}
#| message: false
#| echo: false
library(ggplot2)
theme_set(theme_grey(base_size = 16))
```

```{r ggplot}
#| message: false
library(ggplot2)
se |> 
  filter(.feature %in% c("g1","g2")) |> 
  ggplot(aes(condition, counts, color=sex, group=sex)) + 
  geom_point(size=2) + 
  geom_line() +
  facet_wrap(~.feature) +
  ylim(0,200)
```

```{r}
#| echo: false
options(dplyr.summarise.inform=FALSE)
```

```{r grouping-genes}
# suppose we had gene groups
rowData(se)$gene_group <- c(1,1,2,2)
se |> 
  group_by(gene_group, condition, sex) |> 
  summarize(ave_count = mean(counts), sd_count = sd(counts)) |> 
  ggplot(aes(condition, ave_count, 
             ymin=ave_count - 2*sd_count, 
             ymax=ave_count + 2*sd_count,
             color=sex, group=sex)) + 
  geom_pointrange(position = position_dodge(width = .25)) + 
  facet_wrap(~gene_group, labeller = "label_both") +
  ylim(0,200)
```

We can `mutate` assay data, row data, or column data by modifying
existing columns. If we want to add new data to rows or columns, e.g.
row summaries, a fast way to do this is `mutate_features` (or
`mutate_samples` for columns). These convenience functions are added
in the devel branch in Summer 2024, and available with
`install_github`.

```{r}
se %>%
  mutate_features(row_vars = rowVars(assay(.))) |>
  filter(row_vars > 100)
```

## Also works with Seurat and SCE

*SingleCellExperiment* = *SummarizedExperiment* with extra bells and
whistles for single cells. E.g. slots for reduced dimensions.

```{r umap}
#| message: false
library(tidySingleCellExperiment)
library(scales)
# data from tidyomics/tidyomicsWorkshopBioc2023
sce <- readRDS("data/tidyomicsWorkshopSCE.rds")
# SCE is slightly different than SE, more cell focused
sce |>
  filter(Phase == "G1") |>
  ggplot(aes(UMAP_1, UMAP_2, color=nCount_RNA)) +
  geom_point() + 
  scale_color_gradient2(trans="log10")
```

```{r}
# can include and compute on gene expression values
sce |>
  join_features(c("CD3D","TRDC"), shape="wide") |>
  select(.cell, CD3D, TRDC)
```

```{r}
# process the sample ID from the filename
sce <- sce |>
  extract(file, "sample", "../data/.*/([a-zA-Z0-9_-]+)/outs.+")
sce |>
  select(sample)
# aggregate across sample = pseudobulking, returns SE
sce |>
  aggregate_cells(sample)
```

## Genomic overlap as a `join`

```{r}
#| echo: false
#| message: false
library(plyranges)
set.seed(5)
n <- 40
x <- data.frame(seqnames=1, start=round(runif(n, 101, 996)), 
                width=2, score=rnorm(n, mean=5)) |>
  as_granges() |>
  sort()
seqlengths(x) <- 1000
y <- data.frame(seqnames=1, start=c(101, 451, 801), 
                width=200, id = c("a","b","c")) |>
  as_granges()
```

```{r}
library(plyranges)
x
y
```

![](figures/woodjoin.png){fig-alt="Picture of two stacks of wood being interleaved"}

```{r}
x |> join_overlap_inner(y)
```

Many options, `directed`, `within`, `maxgap`, `minoverlap`, etc.

```{r}
# chaining operations
x |>
  filter(score > 3.5) |>
  join_overlap_inner(y) |>
  group_by(id) |>
  summarize(ave_score = mean(score), n = n())
```

```{r ranges-plot}
# pipe to plot
x |>
  filter(score > 3.5) |>
  join_overlap_inner(y) |>
  as_tibble() |>
  ggplot(aes(x = id, y = score)) + 
  geom_violin() + geom_jitter(width=.1)
```

```{r}
# many convenience functions
y |> 
  anchor_5p() |> # 5', 3', start, end, center
  mutate(width=2) |>
  join_nearest(x, distance=TRUE)
```

## `nullranges`

We developed a package
[nullranges](https://nullranges.github.io/nullranges), as a modular
tool to assist with making genomic comparisons. It doesn't do
enrichment analysis but provides null genomic range sets for
investigating various hypotheses.

## Bootstrapping ranges

Statistical papers from the ENCODE project noted that *block
bootstrapping* genomic data preserves important spatial patterns
(Bickel *et al.* 2010).

![](figures/boot.png){fig-alt="Diagram of block bootstrapping genomic ranges. Blocks are resampled from original data and arranged to form new range sets."}

```{r}
#| message: false
library(nullranges)
boot <- bootRanges(x, blockLength=10, R=20)
# keep track of bootstrap iteration, gives boot dist'n
boot |>
  join_overlap_inner(y) |>
  group_by(iter, id) |>
  summarize(n_overlaps = n())
```

```{r boot-plot}
boot |>
  join_overlap_inner(y) |>
  group_by(iter, id) |>
  summarize(n_overlaps = n()) |>
  as_tibble() |>
  ggplot(aes(x = id, y = n_overlaps)) + 
  geom_boxplot()
```

## Matching ranges

Matching on covariates from a large pool allows for more focused
hypothesis testing.

![](figures/match.png){fig-alt="Diagram of matching genomic ranges. A pool of different colored ranges are drawn from to match the warmer colors of a focal set of ranges."}

```{r match-covar-plot}
xprime <- x |>
  filter(score > 5) |>
  mutate(score = rnorm(n(), mean = score, sd = .5))
m <- matchRanges(focal = xprime, pool = x, covar = ~score)
plotCovariate(m)
```

```{r}
combined <- bind_ranges(
  focal = xprime,
  matching = matched(m),
  pool = x,
  .id = "origin"
)
combined
# now use the different sets for computation:
combined |> 
  join_overlap_inner(y) |>
  group_by(id, origin) |>
  summarize(ave_score = mean(score))
```

```{r match-overlap-plot}
combined |> 
  join_overlap_inner(y) |>
  group_by(id, origin) |>
  summarize(ave_score = mean(score), sd = sd(score)) |>
  as_tibble() |>
  ggplot(aes(origin, ave_score, 
             ymin=ave_score-2*sd, ymax=ave_score+2*sd)) + 
  geom_pointrange() +
  facet_wrap(~id, labeller = label_both)
```

These are published as Application Notes:

-   [bootRanges](https://doi.org/10.1093/bioinformatics/btad190)

-   [matchRanges](https://doi.org/10.1093/bioinformatics/btad197)

## What has been implemented

-   Matrix-shaped objects (SE, SCE)

-   Ranges

-   Interactions

-   Cytometry

-   Spatial

-   more to come...

## Limitations

-   package code and *non-standard evaluation*
-   optimized code, e.g. matrix operations

## Outro

Recommend genomic data analysts are always checking:

-   main contributions to variance (e.g. PCA, see `plotPCA` for bulk
    and [OSCA](https://bioconductor.org/books/release/OSCA/) for sc)
-   column and row densities (`tidySE` allows directly plotting
    `geom_density` of rows/columns, or `geom_violin`)
-   known positive features, feature-level plots (`filter` to feature,
    pipe to `geom_point` etc.)

If you're interested in more complicated use cases of `tidyomics` see
this online book:

-   [Tidy ranges
    tutorial](https://tidyomics.github.io/tidy-ranges-tutorial)

## Contributors

-   Stefano Mangiola (*tidyomics* leadership, tidy expression, single
    cell, spatial)

-   Eric Davis, Wancen Mu, Doug Phanstiel (*nullranges*)

-   Stuart Lee, Michael Lawrence, Di Cook (*plyranges*)

And **tidyomics developers**: William Hutchison, Timothy Keyes, Helena
Crowell, Jacques Serizay, Charlotte Soneson, Eric Davis, Noriaki Sato,
Lambda Moses, Boyd Tarlinton, Abdullah Nahid, Miha Kosmac, Quentin
Clayssen, Victor Yuan, Wancen Mu, Ji-Eun Park, Izabela Mamede, Min
Hyung Ryu, Pierre-Paul Axisa, Paulina Paiz, Chi-Lam Poon, Ming Tang

## Funding

*tidyomics* project funded by an Essential Open Source Software award
from CZI and Wellcome Trust