HW1 Submission: Visualizing Locations of Different Gene Expressions


Dee Velazquez
I am an undergraduate student majoring in computer science and chemical & biomolecular engineering at Johns Hopkins. I have had previous lab and research experience in cellular biology, computational chemistry, and deep learning. I am currently interested in using my technical background to analyze data and solve biomedical problems. Outside the lab, I enjoy working out, listening to music, cooking, and watching movies/anime.

HW1 Submission: Visualizing Locations of Different Gene Expressions

[description] For my data visualization, I analyzed 15 different genes and displayed the proportion of genes expressed in a certain spot, also showing their location. I have spatial data that I’m visualizing (x,y), as well as categorical data for each gene, and numerical data for the count of each gene in a spot. For geometric primitives, I am using points (but looks like a line) to represent each spatial data point, and area to show the genes expressed at that point and how many are expressed. For visual channels, I am using hue to show the different types of genes expressed at each point, and size to show the quantitative proportion of genes expressed at that point, and the shape of a rectangle (but looks like a line) to represent a spot. I am trying to make salient the different types of genes expressed at each spatial location and see where certain genes are more expressed than others. I used my scatterbar package (which I developed in an internship, however, I see with the vast amount of spots present, it is tough to get a clear picture, suggesting that further improvements could be made). Gestalt principles I used were similarity, with each gene associated with a hue in the scatterbar, and also continuity (at least I tried), with each spot being represented by a stacked bar chart of different colors to show the proportions adding up to 1 and what is expressed at that spot.

# Dee Velazquez
# HW 1

# Get eevee dataset
data <- read.csv('eevee.csv.gz', row.names = 1)
dim(data)
ncol(data)
data[1:10, 1:10]
colnames(data)

library(ggplot2)
library(dplyr)

#x <- data$aligned_x
#y <- data$aligned_y

# Create a pos df with x and y
pos <- data[2:3]
pos <-as.data.frame(pos)
# Rename columns to x and y
names(pos) <- c("x", "y", "spot")
# Make data a df
data <- as.data.frame(data)
# Create a column for spots
data$spot <- rownames(data)
# Create df for genes
genes<- data[, 4:ncol(data)]
# Create a df to easily view gene counts in a spot
data_long <-tidyr::pivot_longer(genes, cols = -spot, names_to = "genes", values_to = "count")
# Focus on 15 genes and repeat process
new_genes<-genes[1:15]
new_genes$spot<-rownames(new_genes)
data_long2 <-tidyr::pivot_longer(new_genes, cols = -spot, names_to = "genes", values_to = "count")
# Turn gene counts into proportions, so we can see the proportion of genes
# expressed in each spot
data_long3 <-
  data_long2 %>%
  group_by(spot) %>%
  mutate(proportion = count / sum(count)) %>%
  select(-count) %>%
  distinct()

# Combine pos and gene proportion data
# We can then use this to create a scatterbar (stacked
# bar graph at each (x,y))
combined_data <- merge(data_long3, pos, by = "spot")
combined_data <- combined_data %>%
  group_by(x, y) %>%
  # Ensures that the heights of the bars within a spot add up to 1
  mutate(cumulative_proportion = cumsum(proportion) - proportion)

# Correct positioning of bars within each (x, y) spot
p <- ggplot(combined_data, aes(x = x, y = y + cumulative_proportion + proportion/2)) +
  geom_tile(aes(fill = genes, height = proportion), width = 50, lwd = 0) +
  theme_bw() +
  labs(title = "Eevee Visual", x = "X", y = "Y")
p