debugging-merged-pipeline

Author

Saideep Gona

Published

March 8, 2023

Code
suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/c/Users/Saideep/Box/imlab-data/data-Github/Daily-Blog-Sai"

## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="debugging-merged-pipeline" ## copy the slug from the header
bDATE='2023-03-08' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
[1] 0
Code
WORK=DATA

Context

Yesterday, I observed that the Enformer CAGE outputs for ERAP2 from the merged pipeline differ from my prior pipeline. It doesn’t currently seem that the quantification itself is buggy. This is a continuation of the debugging process.

Benchmarking against google colab notebook

The Enformer Google Colab Notebook is a useful debugging setting because it is unlikely to have bugs when running basic Enformer runs. I therefore simply input the ERAP2 reference sequence into google colab as a comparison track. The output for ERAP2 LCL appears to more closely match my prior pipeline results than the merged results. That is, the summed haplotype prediction for a single haplotype of my run and the reference google colab are both ~130, as opposed to ~50 for the merged pipeline.

Debugging merged pipeline outputs

In response to the above, today involved a long debugging process with Temi. The process is outlined in /projects/covid-ct/imlab/users/saideep/enformer_all_geuvadis/other/test_predictions.ipynb. Because the process was long, full documentation will be completed tomorrow. In summary, however, there appears to have been a misalignment when reading the reference genome causing the sequence mutations to be off by 1 bp and resulting in heavily mutated input sequence. We confirmed that the reference and VCF files to indeed match in terms of reference allele, and the quantification scheme does appear to be accurate. For the future, extra checks to make sure the reference sequence matches the REF allele of VCF variants before mutations are made is highly recommended. If this passes for all variants, it is extremely unlikely that there is an issue with the creation of personalized sequence.

Again, more detail to come tomorrow.