suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="population-allele-freq-enformer"## copy the slug from the headerbDATE='2023-04-24'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA
Context
With the upcoming ALCF INCITE Hackathon, our goals are two-fold:
1.) Train Enformer on Polaris using a parallelized framework 2.) Train Enformer on population allele frequency input
Selecting non-Geuvadis EUR ancestry individuals
Ideally, the above strategy can be done on any population subset we are interested in. For example, we can choose our Geuvadis individuals of European ancestry to be our “training” population. Let’s extract those now:
We are extracting non-geuvadis EUR individuals because we are assuming that they are more likely to match demographically with the identity of the “target” tracks used to train Enformer.
Computing allele frequencies in the population of interest
Working on the cluster for this:
Sex chromosomes
One issue which comes up when enacting pop-seq training
Source Code
---title: "population-allele-freq-enformer"author: "Saideep Gona"date: "2023-04-24"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="population-allele-freq-enformer"## copy the slug from the headerbDATE='2023-04-24'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```## ContextWith the upcoming ALCF INCITE Hackathon, our goals are two-fold:1.) Train Enformer on Polaris using a parallelized framework2.) Train Enformer on population allele frequency input### Selecting non-Geuvadis EUR ancestry individualsIdeally, the above strategy can be done on any population subset we are interested in. For example, we can choose our Geuvadis individuals of European ancestry to be our "training" population. Let's extract those now:```{r}library(tidyverse)metadata_dir <-"/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/metadata"individual_metadata <-as.data.frame(read_delim(file.path(metadata_dir,"igsr_samples.tsv")))rownames(individual_metadata) <- individual_metadata$`Sample name`individual_metadata <- individual_metadata[!is.na(individual_metadata$`Population code`),]table(individual_metadata$`Superpopulation code`)dset_dir <-file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/enformer_geuvadis")geuvadis_full <-as.data.frame(read_table(file.path(dset_dir,"GD462.GeneQuantRPKM.50FN.samplename.resk10.txt")))geuvadis <-separate(geuvadis_full, 2, "Gene_Symbol_uv")rownames(geuvadis) <- geuvadis$Gene_Symbol_uvgeuvadis <- geuvadis[,5:ncol(geuvadis)]individual_metadata_geu <- individual_metadata[colnames(geuvadis),]individual_metadata_geu_EUR <- individual_metadata_geu[individual_metadata_geu$`Superpopulation code`=="EUR",]write.csv(individual_metadata_geu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_geuvadis.csv"))non_geuvadis <-setdiff(rownames(individual_metadata),colnames(geuvadis))individual_metadata_nongeu <- individual_metadata[as.vector(non_geuvadis),]individual_metadata_nongeu_EUR <- individual_metadata_nongeu[individual_metadata_nongeu$`Superpopulation code`=="EUR",]write.csv(individual_metadata_nongeu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_nongeuvadis.csv"))length(intersect(rownames(individual_metadata_geu_EUR), rownames(individual_metadata_nongeu_EUR)))```We are extracting non-geuvadis EUR individuals because we are assuming that they are more likely to match demographically with the identity of the "target" tracks used to train Enformer.### Computing allele frequencies in the population of interestWorking on the cluster for this:### Sex chromosomesOne issue which comes up when enacting pop-seq training