Cell type exploration using differential gene expression analyses


Archana Balan
I ama a first year Biomedical Engineering PhD Student at Johns Hopkins University. My research interests include developing computational methods for cancer immunogenomics

Cell type exploration using differential gene expression analyses

In the above visualization I have identified a cluster that belong to plasma cells or mature B cells. I started with normalizing the gene expression data by the total gene count. To perform k-means clustering I iteratively computed the total withinness score for different values of k and found 11 clusters to be an optimal number for the pikachu dataset. I then picked cluster 3 for differential gene expression analyses. Panel A shows cluster 3 in the lower dimensional space (tSNE) and panel B shows cluster 3 in the spatial dimension. Further I performed differential gene expression analyses using a Wilcoxon rank sum test, comparing cluster 3 with the rest of the cells. Panel E shows a volaco plot with log2 fold change on the x axis and -log10(pval +1e-300) on the y-axis. The points in red show significantly down regulated genes (log2fc < -2 and p < 0.05) and points in bIue show significantly upregulated genes (log2fc > 2 and p < 0.05) then visualized TENT5C gene which was significantly upregulated in cluster 3.Panel C shows expression of TENT5C in the lower dimensional space (tSNE) and panel D shows TENT5C expression in the spatial dimension.

Identifying cell type of cluster 3 The volcano plot in panel E shows that MZB1, CD79A, TENT5C, DERL3 , TNFRSF17 are significantly upregulated in cells in cluster 3. The upregulation of these genes indicates that cells potentially belong to plasma cells or mature B cells. The human protein atlas shows that these genes are most highly expressed in plasma cells followed by B cells[1]. Plasma cells are formed from activated/mature B cells and are responsible for the first line of immune response. TENT5C gene is highly expressed in plasma cells as well as activated B cells which eventually form plasma cells [2]. DERL3 and MZB1 were reported as gene markers for plasma cells from a study of whole transcriptional analysis of RNA sequencing data [3]. MZB1 is also shown to be highly expressed in marginal zone B or B1 cells[4] TNFRSF17 has also been reported as an important gene marker for plasma cells [5] and is known to be highly expressed in mature B cells [6].CD79A is shown to be expressed though all stages of B cell maturation [7]. Hence the cells in cluster 3 most likely belong to plasma cells or mature B cells.

References: [1] https://www.proteinatlas.org/humanproteome/single+cell+type [2] https://www.nature.com/articles/s41467-020-15835-3 [3] https://translational-medicine.biomedcentral.com/articles/10.1186/s12967-020-02220-3 [4] https://www.nature.com/articles/s41598-020-78293-3 [5] https://www.biocompare.com/Editorial-Articles/585833-A-Guide-to-Plasma-Cell-Markers/ [6] https://www.ncbi.nlm.nih.gov/gene/608 [7] https://pubmed.ncbi.nlm.nih.gov/7632952/

library(dplyr)
library(tidyverse)
library(ggplot2)
library(Rtsne)
library(patchwork)
library(viridis)
library(ggrepel)

set.seed(123)
## Data Description
# Each row is a cell ~17000
# Columns are mainly genes ~300

# Set working directory path and read in data
setwd("/Users/archanabalan/Karchin Lab Dropbox/Archana Balan/Coursework/Spring2024/GDV/homework/")
data <- read.csv("./data/pikachu.csv.gz", row.names =1)

# Drop first 5 columns for gene expression (cell_id, cell_area etc)
gexp = data[, -c(1:5)]
rownames(gexp) <- data$cell_id

# normalize data by total gene expression for each cell
gexpnorm <- gexp/rowSums(gexp)
# check if normalization is correct (value would be 1 for all cells)
unique(rowSums(gexpnorm) )

set.seed(123)
# dimensionality reduction 
tsne= Rtsne(gexpnorm, dims =2)

# k-means clustering

## iteratively compute tot.withinness score for different values of k 
withinnes.list = sapply(c(5:15), function(x) { 
     print(x)
     tmp = kmeans(gexpnorm, centers = x)
    return(tmp$tot.withinss)} )
## visualize data to determine the elbow of the plot
p0 = ggplot(data.frame(k = c(5:15), tot.withinss = withinnes.list), 
            aes(k, tot.withinss)) +
            geom_point()

# k=11 is found to be an optimal number for clustering the data
com = kmeans(gexpnorm, centers = 11)$cluster

# set cluster to analyze further
select.cluster = 3

# differential gene expression of cluster 3 vs all other cells
# performing wilcoxon rank sum test for all genes
w.res = sapply(colnames(gexpnorm), function(x){
  wilcox.test(gexpnorm[com == select.cluster, x],
              gexpnorm[!com == select.cluster, x], 
              alternative = "two.sided")$p.val})

# compute log2 fold change for cluster 3 
logfc = sapply(colnames(gexp), function(i){
  log2(mean(gexpnorm[com ==select.cluster, i])/mean(gexpnorm[com !=select.cluster, i]))})


# Plots

# Visualize cluster 3: tSNE
cluster.cols <- c("purple", "lightgrey" )
names(cluster.cols ) <- c( "Cluster_3", "Other")

plot.df <- data.frame(emb1 = tsne$Y[,1], 
                      emb2 = tsne$Y[,2],
                      cluster = ifelse(com == select.cluster, "Cluster_3", "Other"))

p1 <- ggplot(plot.df, aes(emb1, emb2,col=cluster)) + 
  geom_point(size = 0.4) +
  theme_classic() + 
  scale_color_manual(values = cluster.cols)+
  ggtitle("Cell cluster 3 visualized on tSNE space") 

# Visualize cluster 3: spatial coords
plot.df <- data.frame(aligned_x = data$aligned_x, 
                      aligned_y = data$aligned_y,
                      cluster = ifelse(com == select.cluster, "Cluster_3", "Other"))

p2 <- ggplot(plot.df, aes(aligned_x, aligned_y, col=cluster)) + 
  geom_point(size=0.4) +
  theme_classic() + 
  scale_color_manual(values = cluster.cols)+
  ggtitle("Cell cluster 3 visualized on physical space")



# Volcano plot 
# Ref: https://dplyr.tidyverse.org/reference/case_when.html
plot.df =  data.frame(pv = -log10(w.res + 1e-300), 
                      logfc = logfc, 
                      name = colnames(gexp)) %>% 
                      mutate(gene.reg = case_when( pv >1.30103 &  logfc < -2 ~ "Downregulated", 
                                                   pv >1.30103 &  logfc > 2 ~ "Upregulated",
                                                    .default  = "No_difference"))


gene.cols = c("red", "blue", "grey")
names(gene.cols) = c("Downregulated", "Upregulated", "No_difference")

#Ref: https://ggplot2.tidyverse.org/reference/geom_abline.html
#Ref: https://www.datanovia.com/en/blog/how-to-remove-legend-from-a-ggplot/

p3 <- ggplot(data =plot.df , aes(x = logfc, y = pv, col = gene.reg)) +
          geom_point()+
          geom_vline(xintercept = 2, linetype='dashed', col = 'grey') +
          geom_vline(xintercept = -2, linetype='dashed', col = 'grey') +
          geom_hline(yintercept = 1.30103, linetype='dashed', col = 'grey')+
          geom_label_repel(data=subset(plot.df, pv> 20) %>% filter(logfc <= -2 | logfc >=2 ) ,
                           aes(logfc,pv,label=name)) +
        
          theme_classic() +
          theme(legend.position = "None")+
          scale_color_manual(values = gene.cols) +
          xlab( "log2 fold change")+
          ylab("-log10(pval + 1e-300)") +
          ggtitle("Volcano plot of genes for cluster 3")

# Visualize TENT5C: tSNE
plot.df <- data.frame(emb1 = tsne$Y[,1], 
                      emb2 = tsne$Y[,2],
                      TENT5C = gexpnorm$TENT5C)

p4 <- ggplot(plot.df, aes(emb1, emb2,col=TENT5C)) + 
  geom_point(size = 0.4) +
  theme_classic() + 
  scale_color_gradient(low='lightgrey', high='red') +
  ggtitle("TENT5C expression visualized on tSNE space") 

# Visualize TENT5C: spatial coords
plot.df <- data.frame(aligned_x = data$aligned_x, 
                      aligned_y = data$aligned_y,
                      TENT5C = gexpnorm$TENT5C)

p5 <- ggplot(plot.df, aes(aligned_x, aligned_y, col=TENT5C)) + 
  geom_point(size=0.4) +
  theme_classic() + 
  scale_color_gradient(low='lightgrey', high='red') +
  ggtitle("TENT5C expression visualized on physical space")

# Ref:https://patchwork.data-imaginist.com/articles/guides/layout.html 
# Ref: https://ggrepel.slowkow.com/articles/examples
# define layout using patchwork
plot.layout <- "
AB
CD
EE
"
hw4.plt <- p1 + p2 +  p4 + p5 + p3+
  plot_layout(design = plot.layout, heights = c(1, 1, 2))+
  plot_annotation(tag_levels = 'A')

ggsave( "./hw4/hw4_abalan2.png",hw4.plt, width = 14, height = 12)