suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/c/Users/Saideep/Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="debugging-merged-pipeline"## copy the slug from the headerbDATE='2023-03-08'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
[1] 0
Code
WORK=DATA
Context
Yesterday, I observed that the Enformer CAGE outputs for ERAP2 from the merged pipeline differ from my prior pipeline. It doesn’t currently seem that the quantification itself is buggy. This is a continuation of the debugging process.
Benchmarking against google colab notebook
The Enformer Google Colab Notebook is a useful debugging setting because it is unlikely to have bugs when running basic Enformer runs. I therefore simply input the ERAP2 reference sequence into google colab as a comparison track. The output for ERAP2 LCL appears to more closely match my prior pipeline results than the merged results. That is, the summed haplotype prediction for a single haplotype of my run and the reference google colab are both ~130, as opposed to ~50 for the merged pipeline.
Debugging merged pipeline outputs
In response to the above, today involved a long debugging process with Temi. The process is outlined in /projects/covid-ct/imlab/users/saideep/enformer_all_geuvadis/other/test_predictions.ipynb. Because the process was long, full documentation will be completed tomorrow. In summary, however, there appears to have been a misalignment when reading the reference genome causing the sequence mutations to be off by 1 bp and resulting in heavily mutated input sequence. We confirmed that the reference and VCF files to indeed match in terms of reference allele, and the quantification scheme does appear to be accurate. For the future, extra checks to make sure the reference sequence matches the REF allele of VCF variants before mutations are made is highly recommended. If this passes for all variants, it is extremely unlikely that there is an issue with the creation of personalized sequence.
Again, more detail to come tomorrow.
Source Code
---title: "debugging-merged-pipeline"author: "Saideep Gona"date: "2023-03-08"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/c/Users/Saideep/Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="debugging-merged-pipeline"## copy the slug from the headerbDATE='2023-03-08'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```# ContextYesterday, I observed that the Enformer CAGE outputs for ERAP2 from the merged pipeline differ from my prior pipeline. It doesn't currently seem that the quantification itself is buggy. This is a continuation of the debugging process.## Benchmarking against google colab notebookThe Enformer Google Colab Notebook is a useful debugging setting because it is unlikely to have bugs when running basic Enformer runs. I therefore simply input the ERAP2 reference sequence into google colab as a comparison track. The output for ERAP2 LCL appears to more closely match my prior pipeline results than the merged results. That is, the summed haplotype prediction for a single haplotype of my run and the reference google colab are both \~130, as opposed to \~50 for the merged pipeline.## Debugging merged pipeline outputsIn response to the above, today involved a long debugging process with Temi. The process is outlined in /projects/covid-ct/imlab/users/saideep/enformer_all_geuvadis/other/test_predictions.ipynb. Because the process was long, full documentation will be completed tomorrow. In summary, however, there appears to have been a misalignment when reading the reference genome causing the sequence mutations to be off by 1 bp and resulting in heavily mutated input sequence. We confirmed that the reference and VCF files to indeed match in terms of reference allele, and the quantification scheme does appear to be accurate. For the future, extra checks to make sure the reference sequence matches the REF allele of VCF variants before mutations are made is highly recommended. If this passes for all variants, it is extremely unlikely that there is an issue with the creation of personalized sequence.Again, more detail to come tomorrow.