suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="updating-existing-geuvadis-analysis-p2"## copy the slug from the headerbDATE='2023-03-17'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA
Context
Today I am continuing to update my analyses with new data.
Constructing gene metadata table
I task I have put off for a while is collecting gene and individual metadata tables. These are useful context when plotting, modeling, etc. I will store these in a general location for cross-analysis accessibility: “C:\Users\Saideep\Box\imlab-data\data-Github\analysis-Sai\metadata”
Let’s start the table off with a set of protein-coding genes from a GRCh38 annotation file
Festus shared with me a heritability database. I will explore it for the purpose of having gene-level heritability + pve metadata. The following is the code he shared with me:
Code
library(RSQLite)gen_db <-"C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\analysis-Sai\\metadata\\genarch.db"sqlite <-dbDriver("SQLite")db =dbConnect(sqlite,gen_db)tissues_in_table =dbGetQuery(db,"select DISTINCT tissue from results")genes_in_table =dbGetQuery(db,"select DISTINCT ensid from results")print(tissues_in_table)
The table has 20 tissue types in it. I am most interested in LCL cell line at the moment. There is no LCL in this dataset, so I will just use “WholeBlood_TS” as a proxy.
Code
db =dbConnect(sqlite,gen_db)blood_gene_metadata =dbGetQuery(db,"select * from results where tissue = 'WholeBlood_TS'")dbDisconnect(db)gene_ids_h2 <-sapply(str_split(blood_gene_metadata$ensid, "\\."),`[`,1)rownames(blood_gene_metadata) <- gene_ids_h2
I will now add some of this data to my existing metadata table:
---title: "Updating Existing GEUVADIS Analyses part 2"author: "Saideep Gona"date: "2023-03-17"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="updating-existing-geuvadis-analysis-p2"## copy the slug from the headerbDATE='2023-03-17'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```# ContextToday I am continuing to update my analyses with new data.## Constructing gene metadata tableI task I have put off for a while is collecting gene and individual metadata tables. These are useful context when plotting, modeling, etc. I will store these in a general location for cross-analysis accessibility: "C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\analysis-Sai\\metadata"Let's start the table off with a set of protein-coding genes from a GRCh38 annotation file```{r}gene_metadata =read.delim("C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\analysis-Sai\\metadata\\protein_coding_nosex.bed", header =FALSE)colnames(gene_metadata) <-c("chrom","start","end","gene_id","gene_name")gene_metadata$length <- gene_metadata$end - gene_metadata$startgene_ids <-sapply(str_split(gene_metadata$gene_id, "\\."),`[`,1)rownames(gene_metadata) <- gene_idsgene_metadata$gene_id <-NULL```### HeritabilityFestus shared with me a heritability database. I will explore it for the purpose of having gene-level heritability + pve metadata. The following is the code he shared with me:```{r}library(RSQLite)gen_db <-"C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\analysis-Sai\\metadata\\genarch.db"sqlite <-dbDriver("SQLite")db =dbConnect(sqlite,gen_db)tissues_in_table =dbGetQuery(db,"select DISTINCT tissue from results")genes_in_table =dbGetQuery(db,"select DISTINCT ensid from results")print(tissues_in_table)print(nrow(genes_in_table))dbDisconnect(db)```The table has 20 tissue types in it. I am most interested in LCL cell line at the moment. There is no LCL in this dataset, so I will just use "WholeBlood_TS" as a proxy.```{r}db =dbConnect(sqlite,gen_db)blood_gene_metadata =dbGetQuery(db,"select * from results where tissue = 'WholeBlood_TW'")dbDisconnect(db)gene_ids_h2 <-sapply(str_split(blood_gene_metadata$ensid, "\\."),`[`,1)rownames(blood_gene_metadata) <- gene_ids_h2```I will now add some of this data to my existing metadata table:```{r}sub_blood_gene_metadata <- blood_gene_metadata[as.vector(gene_ids),]gene_metadata$h2_WholeBlood_TS <- sub_blood_gene_metadata$h2```### Transcription properties```{r}gene_tss_metadata <-read.delim("C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\analysis-Sai\\metadata\\genes_tss.bed", header =FALSE)gene_ids_tss <-sapply(str_split(gene_tss_metadata$V1, "\\."),`[`,1)gene_tss_metadata$id <- gene_ids_tssgene_tss_metadata_nodup <- gene_tss_metadata %>%distinct(id, .keep_all =TRUE)rownames(gene_tss_metadata_nodup) <- gene_tss_metadata_nodup$idsub_gene_tss_metadata <- gene_tss_metadata_nodup[gene_ids,]gene_metadata$strand <- sub_gene_tss_metadata$V3gene_metadata$tss_sites <- sub_gene_tss_metadata$V2rownames(gene_metadata) <- gene_ids```Now I can save the table and use it in my other analyses```{r}write.csv2(gene_metadata,file ="C:\\Users\\Saideep\\Box\\imlab-data\\data-Github\\analysis-Sai\\metadata\\gene_metadata_full.csv",row.names =TRUE)```## Portability UpdateThe following is an updated figure from the portability analysis demonstrating cross-individual correlations for \~3k genes shared between Enformer, Predixcan, and Geuvadis.![](fig-all_popcorrelations-1.png)The results are more consistent with the smaller gene subsets we have run in the past.The population-specific analyses seem a little suspicious (but are uploaded). Going to revisit them after updating the epistasis results.