quantifying_tss_expression_across_sites

Author

Saideep Gona

Published

April 17, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="quantifying_tss_expression_across_sites" ## copy the slug from the header
bDATE='2023-04-17' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

Last week I looked in-depth at the primary TSS quantification for ERAP2 and saw that using the primary TSS alone can be very misleading. Here I will extend across genes and look at the distribution of TSS expression in general. The goal is to justify the choice of quantification technique.

Excluding TSS sites outside of Enformer output window

One issue that causes the quantification to fail is when TSS sites collected from the annotation files fall outside the enformer output window. This mainly happens for larger genes. The simplest solution to this issue is to modify the tss specification file ahead of time rather than the quantification logic itself.

Quantificatin of a single individual at tss resolution across genes

For simplicity, I am going to quantify only a single individual and store quantifications for all TSS for all genes using the above filtered TSS sites

The following table summarizes this:

Code
files_path = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/enformer_geuvadis/CAGE_lcl_enformer_geuvadis_haplo1_all_tss.csv"



prop_first = []
num_tss = []
c = 0

with open(files_path,"r") as f:
  for line in f:
    if c == 0:
      c+=1
      continue
    p_line = line.strip().split(',')
    # print(p_line)
    if p_line[2] == '':
      continue
    vals = p_line[2].split(";")
    
    vals_num = [float(x) for x in vals]
    max_val = max(vals_num)
    sum_val = sum(vals_num)
    prop_first.append(max_val/sum_val)
    num_tss.append(len(vals_num))
    c+=1
    
with open("/Users/saideepgona/Documents/PhD/imlab/Github_Repos/Daily-Blog-Sai/posts/2023-04-17-quantifying_tss_expression_across_sites/prop_first.txt", "w") as o:
  for i in range(len(w_prop_first)):
    n_line = [str(w_prop_first[i]),str(num_tss[i])]
    o.write(",".join(n_line)+"\n")
Code
library(ggplot2)

props_first <- read.csv("/Users/saideepgona/Documents/PhD/imlab/Github_Repos/Daily-Blog-Sai/posts/2023-04-17-quantifying_tss_expression_across_sites/prop_first.txt")
colnames(props_first) <- c("prop_max","num_tss")

hist(props_first$prop_max, breaks = 40)

Code
ggplot(props_first) + geom_point(aes(x=prop_max,y=num_tss))

Code
ggplot(props_first) + geom_hex(aes(x=prop_max,y=num_tss))

Code
ggplot(props_first) + geom_hex(aes(x=prop_max,y=num_tss,fill = stat(log(count))))