population-allele-freq-enformer

Author

Saideep Gona

Published

April 24, 2023

Code

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="population-allele-freq-enformer" ## copy the slug from the header
bDATE='2023-04-24' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

With the upcoming ALCF INCITE Hackathon, our goals are two-fold:

1.) Train Enformer on Polaris using a parallelized framework 2.) Train Enformer on population allele frequency input

Selecting non-Geuvadis EUR ancestry individuals

Ideally, the above strategy can be done on any population subset we are interested in. For example, we can choose our Geuvadis individuals of European ancestry to be our “training” population. Let’s extract those now:

Code

library(tidyverse)

metadata_dir <- "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/metadata"

individual_metadata <- as.data.frame(read_delim(file.path(metadata_dir,"igsr_samples.tsv")))
rownames(individual_metadata) <- individual_metadata$`Sample name`
individual_metadata <- individual_metadata[!is.na(individual_metadata$`Population code`),]

table(individual_metadata$`Superpopulation code`)


    AFR     AMR     EAS     EUR EUR,AFR     SAS 
   1421     535     621     669       1     661

Code

dset_dir <- file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/enformer_geuvadis")

geuvadis_full <- as.data.frame(read_table(file.path(dset_dir,"GD462.GeneQuantRPKM.50FN.samplename.resk10.txt")))
geuvadis <- separate(geuvadis_full, 2, "Gene_Symbol_uv")
rownames(geuvadis) <- geuvadis$Gene_Symbol_uv
geuvadis <- geuvadis[,5:ncol(geuvadis)]

individual_metadata_geu <- individual_metadata[colnames(geuvadis),]

individual_metadata_geu_EUR <- individual_metadata_geu[individual_metadata_geu$`Superpopulation code` == "EUR",]

write.csv(individual_metadata_geu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_geuvadis.csv"))

non_geuvadis <- setdiff(rownames(individual_metadata),colnames(geuvadis))

individual_metadata_nongeu <- individual_metadata[as.vector(non_geuvadis),]

individual_metadata_nongeu_EUR <- individual_metadata_nongeu[individual_metadata_nongeu$`Superpopulation code` == "EUR",]

write.csv(individual_metadata_nongeu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_nongeuvadis.csv"))

length(intersect(rownames(individual_metadata_geu_EUR), rownames(individual_metadata_nongeu_EUR)))

[1] 0

We are extracting non-geuvadis EUR individuals because we are assuming that they are more likely to match demographically with the identity of the “target” tracks used to train Enformer.

Computing allele frequencies in the population of interest

Working on the cluster for this:

Sex chromosomes

One issue which comes up when enacting pop-seq training

--- title: "population-allele-freq-enformer" author: "Saideep Gona" date: "2023-04-24" format: html: code-fold: true code-summary: "Show the code" execute: freeze: true warning: false --- ```{r} #| label: Set up box storage directory suppressMessages(library(tidyverse)) suppressMessages(library(glue)) PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai" ## COPY THE DATE AND SLUG fields FROM THE HEADER SLUG="population-allele-freq-enformer" ## copy the slug from the header bDATE='2023-04-24' ## copy the date from the blog's header here DATA = glue("{PRE}/{bDATE}-{SLUG}") if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}")) WORK=DATA ``` ## Context With the upcoming ALCF INCITE Hackathon, our goals are two-fold: 1.) Train Enformer on Polaris using a parallelized framework 2.) Train Enformer on population allele frequency input ### Selecting non-Geuvadis EUR ancestry individuals Ideally, the above strategy can be done on any population subset we are interested in. For example, we can choose our Geuvadis individuals of European ancestry to be our "training" population. Let's extract those now: ```{r} library(tidyverse) metadata_dir <- "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/metadata" individual_metadata <- as.data.frame(read_delim(file.path(metadata_dir,"igsr_samples.tsv"))) rownames(individual_metadata) <- individual_metadata$`Sample name` individual_metadata <- individual_metadata[!is.na(individual_metadata$`Population code`),] table(individual_metadata$`Superpopulation code`) dset_dir <- file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/enformer_geuvadis") geuvadis_full <- as.data.frame(read_table(file.path(dset_dir,"GD462.GeneQuantRPKM.50FN.samplename.resk10.txt"))) geuvadis <- separate(geuvadis_full, 2, "Gene_Symbol_uv") rownames(geuvadis) <- geuvadis$Gene_Symbol_uv geuvadis <- geuvadis[,5:ncol(geuvadis)] individual_metadata_geu <- individual_metadata[colnames(geuvadis),] individual_metadata_geu_EUR <- individual_metadata_geu[individual_metadata_geu$`Superpopulation code` == "EUR",] write.csv(individual_metadata_geu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_geuvadis.csv")) non_geuvadis <- setdiff(rownames(individual_metadata),colnames(geuvadis)) individual_metadata_nongeu <- individual_metadata[as.vector(non_geuvadis),] individual_metadata_nongeu_EUR <- individual_metadata_nongeu[individual_metadata_nongeu$`Superpopulation code` == "EUR",] write.csv(individual_metadata_nongeu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_nongeuvadis.csv")) length(intersect(rownames(individual_metadata_geu_EUR), rownames(individual_metadata_nongeu_EUR))) ``` We are extracting non-geuvadis EUR individuals because we are assuming that they are more likely to match demographically with the identity of the "target" tracks used to train Enformer. ### Computing allele frequencies in the population of interest Working on the cluster for this: ### Sex chromosomes One issue which comes up when enacting pop-seq training