0.1_aracena_predixcan

Author

Saideep Gona

Published

November 21, 2023

Context

Having successfully linearized the fine-tuned Aracena predictor, I should also establish a baseline by performing the PrediXcan analysis on the Aracena data. While this was technically already done in the original study Katie doesn’t have the intermediate files from the analysis available. Furthermore, I think I should run the analysis so I have more intimate familiarity with the practical considerations of running PrediXcan as downstream users experience.

How many SNPs are being analyzed?

It will be very helpful to get an idea of which/how many SNPs are involved in all these different operations being conducted.

  • Typical GWAWS:
    • 200,000 - 2,000,000 SNPs in genotyping array
  • 1000 Genomes:
    • 84.7mil total variants across populations
    • 64mil < 0.5% af
    • 12mil 0.5-5% af
    • 8mil > 5% af
    • 3-4mil unique non-ref SNPs per genome
  • Linearization:
    • SNPs in GEUVADIS genotypes: 4,410,942
  • SNPs in Aracena genotypes:
    • Unfiltered: 7,383,276
    • Filtered: 20,516,196

#TODO - Predict personalized aracena expression #TODO - Write out workflow + goals more formally

Preparing expression matrices

Precomputed expression exists, but I need to to figure out a few things with the provided table. First, I need the gene names to be matched up with those in the gene annotations I have been using. Second, I need to know what normalization/processing was done.

Code
library(glue)
library(dplyr)
library(tidyverse)
Code
path_to_aracena_expression <- "/beagle3/haky/users/saideep/projects/aracena_modeling/Inputs/counts_matrices/fully_preprocessed_expression.txt"

aracena_expression_table <- read.table(path_to_aracena_expression, header = TRUE)
Code
aracena_flu_expression_table <- aracena_expression_table %>% 
  select(contains("Flu"))
dim(aracena_flu_expression_table)
[1] 13716    35
Code
aracena_gene_ids <- rownames(aracena_flu_expression_table)
Code
gene_annotation_file <- "/beagle3/haky/users/saideep/github_repos/Daily-Blog-Sai/posts/2023-11-21-aracena_predixcan/canonical_TSS_full_metadata.txt"

gene_annotation_table <- read.delim(gene_annotation_file, header = TRUE, sep=",")

annotation_gene_ids <- gene_annotation_table$ensembl_gene_id
Code
write.table(aracena_gene_ids[(aracena_gene_ids %in% annotation_gene_ids)], file="/beagle3/haky/users/saideep/github_repos/Daily-Blog-Sai/posts/2023-11-22-retrain_expression_predictor/common_genes.txt", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")

sum(aracena_gene_ids %in% annotation_gene_ids)
[1] 10818
Code
sum(annotation_gene_ids %in% aracena_gene_ids)
[1] 10818

From analyzing the gene ids not present in my current annotations, these seem to be mainly non-coding genes such as for lnc-RNAs. From the manuscript, this is due to filtering being performed for RPKM > 2 in either Flu or NI. That is, this geneset includes transcripts with some measurable expression. I think it is ok to use the intersection of the two gene sets for now.