Skip to content

Commit

Permalink
Merge pull request #108 from dvelazq/main
Browse files Browse the repository at this point in the history
HW3 Submission
  • Loading branch information
JEFworks authored Feb 6, 2024
2 parents fe1539f + 4213fdd commit 2b03fde
Show file tree
Hide file tree
Showing 2 changed files with 147 additions and 0 deletions.
147 changes: 147 additions & 0 deletions _posts/2024-02-05-dvelazq5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
layout: post
title: "The Impact of Normalization on Dimensionality Reduction: IGKC Expression Level Case Study"
author: Dee Velazquez
jhed: dvelazq5
categories: [ HW3 ]
image: homework/hw3/hw3_dvelazq5.png
featured: false
---

### What data types are you visualizing?
IGKC gene expression count: quantatative
PC 1 and 2 scores: quantatative
t-SNE coordinates emb1 and emb2: spatial

I am visualizing the expression of the IGKC gene across PC scores and t-SNE coordinates from dimensionality reduction, and the effect of normalizing and not normalizing
gene expression data has on dimensionality reduction (for both PCA and t-SNE) and comparing each figure.

### What data encodings (geometric primitives and visual channels) are you using to visualize these data types?
Geometric Primatives: Points

Visual Channels: Color hues

Each point represents a spot that has a linear combination of the top 100 genes present, and the plasma color hues indicate the expression level of
IGKC, the lighter the color meaning a low expression of IGKC, and the darker the color meaning a high expression of IGKC.

### What about the data are you trying to make salient through this data visualization?
I am trying to illustrate the effect normalizing gene expression data has on dimensionality reduction, especially on max variances and reduced
dimensional space positions, and how the visuals change and can impact the way we interpret the relationships between certain points.

### What Gestalt principles or knowledge about the perceptiveness of visual encodings are you using to accomplish this?
I used the Gestalt principle of similarity and proximity. Points of the same color are perceived as being part of the same group, in this case,
conveying the similarity of spots' IGKC expression levels. Clusters of similar color points close together can be perceived as being part of the same group,
and ideally have similar gene expression profiles and be of similar cell types.

### What happens if I do or do not normalize/transform the gene expression data (e.g., log and/or scale) prior to dimensionality reduction?
Without normalizing the gene expression data prior to PCA, the gene with the highest loading factor in PC1 was IGKC and the gene with the highest loading
factor in PC2 was COL1A1. With normalization, the gene with the highest loading factor in PC1 was IGHA1, and the gene with the highest loading factor in PC2
was CPB1. This is visually reflected in subplots (a) and (c), where the plots are much more distinct from one another; we see different variances in gene expressions at certain spots, and even more darker points
in subplot (c) than (a), indicating greater IGKC expression level. It illustrates the importance of normalization, which may more accurately describe our genetic data.

This is also shown again with our t-SNE subplots (b) and (d). We see that without normalizing the gene expression data prior to t-SNE, we get
very different positions of spots in the reduced-dimensional space. We also see in subplot (d) that there is a higher expression of the IGKC gene than in (b),
which might accurately represent true IGKC expression in spots and indicate similar cell types.

I think the main takeaway from this is the importance of normalizing data before performing a PCA or t-SNE. Without normalization, features with larger variances may skew results in a PCA.
In the case of t-SNE, we measure the distance between points to find the scaled similarity metrics. Without normalization, the distances between points may not accurately represent the true similarities, and can potentially
display incorrect information about our genetic data.

```{r}
# Dee Velazquez
# HW 3
# Get data
data <- read.csv('eevee.csv.gz', row.names = 1)
#Get genes
gexp <- data[,4:ncol(data)]
#Limit number of genes by getting top 100 genes based on expression level
#COL1A1
#IGKC
top_genes <- names(sort(apply(gexp, 2, mean), decreasing=TRUE)[1:100])
filtered_genes <- gexp[, top_genes]
# Apply log10 transformation with addition of 1
# BIG QUESTION:
# Want to know...what happens if I do not normalize gene expression data
# prior to dimensionality reduction?
log_filtered_genes <- log10(filtered_genes + 1)
dim(filtered_genes)
head(log_filtered_genes)
gexp_norm <- log10(filtered_genes/rowSums(filtered_genes) *
mean(rowSums(filtered_genes))+1)
dim(gexp_norm)
head(gexp_norm)
## Dimensionalality reduction w/o normalizing gene expression beforehand
# PCA
pca <- prcomp(filtered_genes)
?prcomp
dim(pca$rotation)
#IGKC is the gene with the highest loading value for PC1
head(sort(pca$rotation[,1], decreasing=TRUE))
#COL1A1 is the gene with the highest loading value for PC2
head(sort(pca$rotation[,2], decreasing=TRUE))
head(pca$x[,1:5])
head(pca$rotation[,1:5])
df <- data.frame(pca$x, filtered_genes)
df$IGKC_log10 <- log10(df$IGKC + 1)
#p1 <- ggplot(df) + geom_point(aes(x = PC1, y= PC2, col=IGKC))
#+ theme_minimal()
p1 <- ggplot(df) + geom_point(aes(x = PC1, y= PC2, col=IGKC_log10), alpha = 0.7) +
scale_color_viridis_c(option = "C", name = "log10(IGKC Expression)", direction = -1) +
labs( title = "PCA of the IGKC Gene At Each Spot Without Normalization",
x = "PC1",
y = "PC2") + theme_minimal()
p1
#tSNE
emb <- Rtsne(filtered_genes, dims = 2)
?Rtsne
df2 <- data.frame(emb=emb$Y,filtered_genes)
df2$IGKC_log10 <- log10(df$IGKC + 1)
p2 <-ggplot(df2) + geom_point(aes(x= emb.1, y =emb.2,col=IGKC_log10), alpha = 0.7) +
scale_color_viridis_c(option = "C", name = "log10(IGKC Expression)", direction = -1) +
labs(title = "t-SNE of the IGKC Gene At Each Spot Without Normalization",
x = "EMB1",
y = "EMB2") + theme_minimal()
p2
## Dimensionalality reduction done after normalized gene expression
# PCA
#gexp_norm
#pca2 <- prcomp(log_filtered_genes)
pca2 <- prcomp(gexp_norm)
head(sort(pca2$rotation[,1], decreasing=TRUE))
#now...IGHA1 is the highest LF for PC1 instead of IGKC
head(sort(pca2$rotation[,2], decreasing=TRUE))
#now...CPB1 is the highest LF for PC2 instead of COL1A1
head(pca2$x[,1:5])
head(pca2$rotation[,1:5])
#df3 <- data.frame(pca2$x, log_filtered_genes)
df3 <- data.frame(pca2$x, gexp_norm)
p3 <- ggplot(df3) + geom_point(aes(x = PC1, y= PC2, col=IGKC), alpha = 0.7) +
scale_color_viridis_c(option = "C", name = "log10(IGKC Expression)", direction = -1) +
labs(title = "PCA of the IGKC Gene At Each Spot With Normalization",
x = "PC1",
y = "PC2") + theme_minimal()
p3
#tSNE
#emb2 <- Rtsne(log_filtered_genes, dims = 2)
emb2 <- Rtsne(gexp_norm, dims = 2)
#df4 <- data.frame(emb=emb2$Y,log_filtered_genes)
df4 <- data.frame(emb=emb2$Y,gexp_norm)
p4 <- ggplot(df4) + geom_point(aes(x = emb.1, y =emb.2, col=IGKC), alpha = 0.7) +
scale_color_viridis_c(option = "C", name = "log10(IGKC Expression)", direction = -1) +
labs(title = "t-SNE of the IGKC Gene At Each Spot With Normalization",
x = "EMB1",
y = "EMB2") + theme_minimal()
p4
# Plotting
p1 + p3 + plot_annotation(tag_levels = 'a') + plot_layout(ncol = 1)
p2 + p4 + plot_annotation(tag_levels = 'a') + plot_layout(ncol = 1)
final_plot <- p1 + p2 + p3 + p4 + plot_annotation(tag_levels = 'a') + plot_layout(ncol = 2)
final_plot
```
Binary file added homework/hw3/hw3_dvelazq5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 2b03fde

Please sign in to comment.