Setting up mutagenesis of whole window

Author

Saideep Gona

Published

March 9, 2023

Code

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/c/Users/Saideep/Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="setting-up-mutagenesis-of-whole-window" ## copy the slug from the header
bDATE='2023-03-08' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))

[1] 1

Code

WORK=DATA

Context

Yesterday, the issue with the merged pipeline was identified. The work for doing that can be found here:

Debugging merged pipeline

Since it means that the prior expression results (All GEUVADIS, ~20k protein coding genes) were likely incorrect, I am rerunning the set on Theta. This should take a few days.

Improved mutagenesis experiments

In the meantime, it is a good idea to prepare the next batch of runs to submit so that we are actively making use of our Theta allocations. To this end, I think it will be helpful for all my downstream analyses to upgrade my existing mutagenesis pipeline in the following ways:

1.) Run both the forward and reverse strand and average their predictions to expand the “sample size” for any given input.

2.) The pipeline should operate using VCF file as input. In particular, it should do mutagenesis on all variants in the VCF file.

3.) The mutagenesis should now also be personalized. Before, we simply mutated the reference sequence at a single site in a gene’s window, and observed the expression fold change between the two inputs. Now we want to measure the expression difference conditioned on individual variation at all sites in the input window for a given individual. This can help us break down the personalized signal we are observing. We can call these “marginal” effect sizes, based on the fact that we are only looking at a single variant having conditioned on the other genotypes being fixed. We can ask important downstream questions from here, such as:

Are the marginal effect sizes constant across individuals?
How closely does do marginal effect sizes match when conditioned on reference background vs. personalized backgrounds?
Is the difference in expression from the reference sequence for a given personalized input sequence easily computed as the sum of effect sizes of alternate variants?

These are all very useful metrics for breaking down the observed signal. It would be really nice, for example, to know what is driving the population variation in ERAP2 expression according to Enformer. In addition, we are currently testing for haplotype epistasis. This is something of a “loaded” analysis. Epistasis signals could be caused by the interaction between two different variants on opposing haplotypes, but the effect of these variants is diluted in the overall haplotypic expression. This could give us a chance to test for more specific haplotypic mechanisms.

Formalizing

We can formalize the above logic a bit. Let’s let $G_{i,g}^{p,1}$ be the predicted(Enformer) expression of gene $g$ for individual $i$ for haplotype $1$. This value is the result of all nucleotides in the personalized sequence jointly passed through Enformer:

$G_{i,g}^{p,1} = Quantification(Enformer((S_{i,w,g} \forall w \in \{1,2,...,W\})))=QE(S_{i,g})$

$"Enformer"$ is a shorthand function representing the standard pretrained Enformer model, taking DNA sequence of length $W=393216$ as input. $Quantification$ is shorthand for whichever quantification process converts the enformer track outputs to finals signal. For genes, the standard is summed haplotype quantification. $QE$ is shorthand for both operations together. $S_{i,w,g}$ is the nucleotide (A,C,G,T) at position $w$ for individual $i$ and within the window designated by gene $g$. $S_{i,g}$ is the whole personalized input sequence for individual $i$ and gene $g$.

For convenience, we can also denote the reference sequence $R_{g}$ as the reference sequence for gene $g$ input window. The expression for the reference sequence at gene $g$ is therefore:

$G_{R,g}^{p} = QE(R_{g})$

The common way to calculate the effect size of a mutation using Enformer is to use the reference sequence as a baseline and then calculate the fold change after mutating a single nucleotide. Formally:

$\delta_{m,g}^{A} = \frac{QE((R_{w,g} \forall w \in \{1,2,...,m-1\},S_{A,m,g},R_{w,g} \forall w \in \{m+1,m+2,...,W\}))}{QE(R_{g})}$

The numerator is just a way of saying that we are running Enformer using the reference sequence, except that the nucleotide at position $m$ is mutated to alternative allele $A$.

$\delta_{m,g}^{A}$ is generally reported as the “effect size” of the mutation according to Enformer. However, it is not guarenteed that this effect will hold consistent in all genomic contexts. For example, we could try and measure the effect size in a given individual:

$\delta_{i,m,g}^{A} = \frac{QE((S_{i,w,g} \forall w \in \{1,2,...,m-1\},S_{A,m,g},S_{i,w,g} \forall w \in \{m+1,m+2,...,W\}))}{QE((S_{i,w,g} \forall w \in \{1,2,...,m-1\},S_{R,m,g},S_{i,w,g} \forall w \in \{m+1,m+2,...,W\}))}$

Keep in mind that Enformer is highly non-linear. It has plenty of room to learn conditional dependencies between input variants. It could be the case that $\delta_{i,m,g}^{A}$ tends to match up with $\delta_{m,g}^{A}$ across individuals in the population, but this has not been properly investigated.

Actually running the experiments

For a given gene, there may be a variable number of listed variants in the VCF file. We will limit the number of variants to only those in the VCF(having some amount of population level variation). For each variant, we need:

2 runs (REF,ALT) for each context
2 contexts per variant (haplo1,haplo2)
N number of individuals we want to

--- title: "Setting up mutagenesis of whole window" author: "Saideep Gona" date: "2023-03-09" format: html: code-fold: true code-summary: "Show the code" execute: freeze: true warning: false --- ```{r} #| label: Set up box storage directory suppressMessages(library(tidyverse)) suppressMessages(library(glue)) PRE = "/c/Users/Saideep/Box/imlab-data/data-Github/Daily-Blog-Sai" ## COPY THE DATE AND SLUG fields FROM THE HEADER SLUG="setting-up-mutagenesis-of-whole-window" ## copy the slug from the header bDATE='2023-03-09' ## copy the date from the blog's header here DATA = glue("{PRE}/{bDATE}-{SLUG}") if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}")) WORK=DATA ``` # Context Yesterday, the issue with the merged pipeline was identified. The work for doing that can be found here: [Debugging merged pipeline](test_predictions.html) Since it means that the prior expression results (All GEUVADIS, \~20k protein coding genes) were likely incorrect, I am rerunning the set on Theta. This should take a few days. ## Improved mutagenesis experiments In the meantime, it is a good idea to prepare the next batch of runs to submit so that we are actively making use of our Theta allocations. To this end, I think it will be helpful for all my downstream analyses to upgrade my existing mutagenesis pipeline in the following ways: 1.) Run both the forward and reverse strand and average their predictions to expand the "sample size" for any given input. 2.) The pipeline should operate using VCF file as input. In particular, it should do mutagenesis on all variants in the VCF file. 3.) The mutagenesis should now also be personalized. Before, we simply mutated the reference sequence at a single site in a gene's window, and observed the expression fold change between the two inputs. Now we want to measure the expression difference conditioned on individual variation at all sites in the input window for a given individual. This can help us break down the personalized signal we are observing. We can call these "marginal" effect sizes, based on the fact that we are only looking at a single variant having conditioned on the other genotypes being fixed. We can ask important downstream questions from here, such as: - How closely does do marginal effect sizes match when conditioned on reference background vs. personalized backgrounds? - Is the difference in expression from the reference sequence for a given personalized input sequence easily computed as the sum of effect sizes of alternate variants? These are all very useful metrics for breaking down the observed signal. It would be really nice, for example, to know what is driving the population variation in ERAP2 expression according to Enformer. In addition, we are currently testing for haplotype epistasis. This is something of a "loaded" analysis. Epistasis signals could be caused by the interaction between two different variants on opposing haplotypes, but the effect of these variants is diluted in the overall haplotypic expression. This could give us a chance to test for more specific haplotypic mechanisms. ### Formalizing We can formalize the above logic a bit. Let's let $G_{i,g}^{p,1}$ be the predicted(Enformer) expression of gene $g$ for individual $i$ for haplotype $1$. This value is the result of all nucleotides in the personalized sequence jointly passed through Enformer: $G_{i,g}^{p,1} = Quantification(Enformer((S_{i,w,g} \forall w \in \{1,2,...,W\})))=QE(S_{i,g})$ $"Enformer"$ is a shorthand function representing the standard pretrained Enformer model, taking DNA sequence of length $W=393216$ as input. $Quantification$ is shorthand for whichever quantification process converts the enformer track outputs to finals signal. For genes, the standard is summed haplotype quantification. $QE$ is shorthand for both operations together. $S_{i,w,g}$ is the nucleotide (A,C,G,T) at position $w$ for individual $i$ and within the window designated by gene $g$. $S_{i,g}$ is the whole personalized input sequence for individual $i$ and gene $g$. For convenience, we can also denote the reference sequence $R_{g}$ as the reference sequence for gene $g$ input window. The expression for the reference sequence at gene $g$ is therefore: $G_{R,g}^{p} = QE(R_{g})$ The common way to calculate the effect size of a mutation using Enformer is to use the reference sequence as a baseline and then calculate the fold change after mutating a single nucleotide. Formally: $\delta_{m,g}^{A} = \frac{QE((R_{w,g} \forall w \in \{1,2,...,m-1\},S_{A,m,g},R_{w,g} \forall w \in \{m+1,m+2,...,W\}))}{QE(R_{g})}$ The numerator is just a way of saying that we are running Enformer using the reference sequence, except that the nucleotide at position $m$ is mutated to alternative allele $A$. $\delta_{m,g}^{A}$ is generally reported as the "effect size" of the mutation according to Enformer. However, it is not guarenteed that this effect will hold consistent in all genomic contexts. For example, we could try and measure the effect size in a given individual: $\delta_{i,m,g}^{A} = \frac{QE((S_{i,w,g} \forall w \in \{1,2,...,m-1\},S_{A,m,g},S_{i,w,g} \forall w \in \{m+1,m+2,...,W\}))}{QE((S_{i,w,g} \forall w \in \{1,2,...,m-1\},S_{R,m,g},S_{i,w,g} \forall w \in \{m+1,m+2,...,W\}))}$ Keep in mind that Enformer is highly non-linear. It has plenty of room to learn conditional dependencies between input variants. It could be the case that $\delta_{i,m,g}^{A}$ tends to match up with $\delta_{m,g}^{A}$ across individuals in the population, but this has not been properly investigated. ### Actually running the experiments For a given gene, there may be a variable number of listed variants in the VCF file. We will limit the number of variants to only those in the VCF(having some amount of population level variation). For each variant, we need: - 2 runs (REF,ALT) for each context - 2 contexts per variant (haplo1,haplo2) - Forward + reverse strand for sample size - N number of individuals we want to run for (462 for Geuvadis) For ERAP2, there are about 10k variable sites. This means that we would need to run: $2*2*2*462*10000=~36,000,000$ 18 million is a very large set, especially for only a single individual. These experiments will therefore probably have to proceed gene by gene, with ERAP2 as a good starting point. How much of this set we actually need to run depends on the goal: 1.) How closely does do marginal effect sizes match when conditioned on reference background vs. personalized backgrounds? * For this question, we can filter and test a subset of variants, maybe representative sets by allele frequency. Then we can look at the effect size distribution across individuals and relate it to the reference effect size. 2.) Is the difference in expression from the reference sequence for a given personalized input sequence easily computed as the sum of effect sizes of alternate variants? * For this question, we will need effect sizes for all variants per individual. We can look at a subset of individuals starting out in this case. Tomorrow, I will start the implementation of this type of mutagenesis.