suppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="modeling_aracena_reference_enformer"## copy the slug from the headerbDATE='2023-08-29'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA
Context
The goal of creating a training dataset from Aracena et al. data is to use it for actual modeling purposes. There are a variety of ways to do this, with the longer term goal being that of fine-tuning Enformer style models. Before doing that, however, I will try simpler approaches. The first approach I will try is to take Enformer target predictions for different genomic regions and use these as feature inputs into a model predicting the corresponding Aracena targets. For maximum simplicity, I will start with reference predictions as inputs during training
Source Code
---title: "modeling_aracena_reference_enformer"author: "Saideep Gona"date: "2023-08-29"format: html: code-fold: true code-summary: "Show the code"execute: freeze: true warning: false---```{r}#| label: Set up box storage directorysuppressMessages(library(tidyverse))suppressMessages(library(glue))PRE ="/Users/saideepgona/Library/CloudStorage/Box-Box/imlab-data/data-Github/Daily-Blog-Sai"## COPY THE DATE AND SLUG fields FROM THE HEADERSLUG="modeling_aracena_reference_enformer"## copy the slug from the headerbDATE='2023-08-29'## copy the date from the blog's header hereDATA =glue("{PRE}/{bDATE}-{SLUG}")if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))WORK=DATA```# ContextThe goal of creating a training dataset from Aracena et al. data is to use it for actual modeling purposes. There are a variety of ways to do this, with the longer term goal being that of fine-tuning Enformer style models. Before doing that, however, I will try simpler approaches. The first approach I will try is to take Enformer target predictions for different genomic regions and use these as feature inputs into a model predicting the corresponding Aracena targets. For maximum simplicity, I will start with reference predictions as inputs during training