Code
library(glue)
library(dplyr)
library(tidyverse)
Saideep Gona
November 21, 2023
Having successfully linearized the fine-tuned Aracena predictor, I should also establish a baseline by performing the PrediXcan analysis on the Aracena data. While this was technically already done in the original study Katie doesn’t have the intermediate files from the analysis available. Furthermore, I think I should run the analysis so I have more intimate familiarity with the practical considerations of running PrediXcan as downstream users experience.
It will be very helpful to get an idea of which/how many SNPs are involved in all these different operations being conducted.
#TODO - Predict personalized aracena expression #TODO - Write out workflow + goals more formally
Precomputed expression exists, but I need to to figure out a few things with the provided table. First, I need the gene names to be matched up with those in the gene annotations I have been using. Second, I need to know what normalization/processing was done.
[1] 13716 35
write.table(aracena_gene_ids[(aracena_gene_ids %in% annotation_gene_ids)], file="/beagle3/haky/users/saideep/github_repos/Daily-Blog-Sai/posts/2023-11-22-retrain_expression_predictor/common_genes.txt", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
sum(aracena_gene_ids %in% annotation_gene_ids)
[1] 10818
[1] 10818
From analyzing the gene ids not present in my current annotations, these seem to be mainly non-coding genes such as for lnc-RNAs. From the manuscript, this is due to filtering being performed for RPKM > 2 in either Flu or NI. That is, this geneset includes transcripts with some measurable expression. I think it is ok to use the intersection of the two gene sets for now.
---
title: "0.1_aracena_predixcan"
author: "Saideep Gona"
date: "2023-11-21"
format:
html:
code-fold: true
code-summary: "Show the code"
execute:
freeze: true
warning: false
---
# Context
Having successfully linearized the fine-tuned Aracena predictor, I should also establish a baseline by performing the PrediXcan analysis on the Aracena data. While this was technically already done in the original study Katie doesn't have the intermediate files from the analysis available. Furthermore, I think I should run the analysis so I have more intimate familiarity with the practical considerations of running PrediXcan as downstream users experience.
### How many SNPs are being analyzed?
It will be very helpful to get an idea of which/how many SNPs are involved in all these different operations being conducted.
* Typical GWAWS:
- 200,000 - 2,000,000 SNPs in genotyping array
* 1000 Genomes:
- 84.7mil total variants across populations
- 64mil < 0.5% af
- 12mil 0.5-5% af
- 8mil > 5% af
- 3-4mil unique non-ref SNPs per genome
* Linearization:
- SNPs in GEUVADIS genotypes: 4,410,942
* SNPs in Aracena genotypes:
- Unfiltered: 7,383,276
- Filtered: 20,516,196
#TODO - Predict personalized aracena expression
#TODO - Write out workflow + goals more formally
### Preparing expression matrices
Precomputed expression exists, but I need to to figure out a few things with the provided table. First, I need the gene names to be matched up with those in the gene annotations I have been using. Second, I need to know what normalization/processing was done.
```{r}
library(glue)
library(dplyr)
library(tidyverse)
```
```{r}
path_to_aracena_expression <- "/beagle3/haky/users/saideep/projects/aracena_modeling/Inputs/counts_matrices/fully_preprocessed_expression.txt"
aracena_expression_table <- read.table(path_to_aracena_expression, header = TRUE)
```
```{r}
aracena_flu_expression_table <- aracena_expression_table %>%
select(contains("Flu"))
dim(aracena_flu_expression_table)
aracena_gene_ids <- rownames(aracena_flu_expression_table)
```
```{r}
gene_annotation_file <- "/beagle3/haky/users/saideep/github_repos/Daily-Blog-Sai/posts/2023-11-21-aracena_predixcan/canonical_TSS_full_metadata.txt"
gene_annotation_table <- read.delim(gene_annotation_file, header = TRUE, sep=",")
annotation_gene_ids <- gene_annotation_table$ensembl_gene_id
```
```{r}
write.table(aracena_gene_ids[(aracena_gene_ids %in% annotation_gene_ids)], file="/beagle3/haky/users/saideep/github_repos/Daily-Blog-Sai/posts/2023-11-22-retrain_expression_predictor/common_genes.txt", row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
sum(aracena_gene_ids %in% annotation_gene_ids)
sum(annotation_gene_ids %in% aracena_gene_ids)
```
From analyzing the gene ids not present in my current annotations, these seem to be mainly non-coding genes such as for lnc-RNAs. From the manuscript, this is due to filtering being performed for RPKM > 2 in either Flu or NI. That is, this geneset includes transcripts with some measurable expression. I think it is ok to use the intersection of the two gene sets for now.
```{r}
```
```{r}
```