population-allele-freq-enformer

Author

Saideep Gona

Published

April 24, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="population-allele-freq-enformer" ## copy the slug from the header
bDATE='2023-04-24' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

Context

With the upcoming ALCF INCITE Hackathon, our goals are two-fold:

1.) Train Enformer on Polaris using a parallelized framework 2.) Train Enformer on population allele frequency input

Selecting non-Geuvadis EUR ancestry individuals

Ideally, the above strategy can be done on any population subset we are interested in. For example, we can choose our Geuvadis individuals of European ancestry to be our “training” population. Let’s extract those now:

Code
library(tidyverse)

metadata_dir <- "/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/metadata"

individual_metadata <- as.data.frame(read_delim(file.path(metadata_dir,"igsr_samples.tsv")))
rownames(individual_metadata) <- individual_metadata$`Sample name`
individual_metadata <- individual_metadata[!is.na(individual_metadata$`Population code`),]

table(individual_metadata$`Superpopulation code`)

    AFR     AMR     EAS     EUR EUR,AFR     SAS 
   1421     535     621     669       1     661 
Code
dset_dir <- file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/analysis-Sai/enformer_geuvadis")

geuvadis_full <- as.data.frame(read_table(file.path(dset_dir,"GD462.GeneQuantRPKM.50FN.samplename.resk10.txt")))
geuvadis <- separate(geuvadis_full, 2, "Gene_Symbol_uv")
rownames(geuvadis) <- geuvadis$Gene_Symbol_uv
geuvadis <- geuvadis[,5:ncol(geuvadis)]

individual_metadata_geu <- individual_metadata[colnames(geuvadis),]

individual_metadata_geu_EUR <- individual_metadata_geu[individual_metadata_geu$`Superpopulation code` == "EUR",]

write.csv(individual_metadata_geu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_geuvadis.csv"))

non_geuvadis <- setdiff(rownames(individual_metadata),colnames(geuvadis))

individual_metadata_nongeu <- individual_metadata[as.vector(non_geuvadis),]

individual_metadata_nongeu_EUR <- individual_metadata_nongeu[individual_metadata_nongeu$`Superpopulation code` == "EUR",]

write.csv(individual_metadata_nongeu_EUR$`Sample name`,file=file.path("/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai/2023-04-24-population-allele-freq-enformer/EUR_nongeuvadis.csv"))

length(intersect(rownames(individual_metadata_geu_EUR), rownames(individual_metadata_nongeu_EUR)))
[1] 0

We are extracting non-geuvadis EUR individuals because we are assuming that they are more likely to match demographically with the identity of the “target” tracks used to train Enformer.

Computing allele frequencies in the population of interest

Working on the cluster for this:

Sex chromosomes

One issue which comes up when enacting pop-seq training